A Tale of Servers, Chat and the Technical Future of ARRSE

Discussion in 'ARRSE: Site Issues' started by Good CO, Mar 20, 2012.

Welcome to the Army Rumour Service, ARRSE

The UK's largest and busiest UNofficial military website.

The heart of the site is the forum area, including:

  1. Good CO

    Good CO LE Admin

    Bad CO asked me to shed some light on why I've been taking the site offline here and there, why we had a long painful series of service issues over the last month, and what we've done and are doing about it.

    The site's demands are continually growing. We still have all the text content and most of the pictures going back to 2001 and the rate of growth is always increasing, and we have steadily improved the software that runs the site, but there are associated increases in hardware power, cost and complexity.

    To keep up with this we had grown a hardware nightmare. Fine for big companies with IT departments, but not so good for us (an IT department of me and a part-time sub-contractor). It was taking up too much time and money. Overall down-time was reduced because of multiple servers, our backup system was really bulletproof (we still have everything back to 2001), our emails were getting through spam filters better, but faults were creeping and and finding them was increasingly difficult.

    As my pretty picture shows, in total the site was running on five servers (ie. separate big beasty computers) with a sixth spare. By comparison, in 2003 ARRSE was running on one.

    All of these needed updates applying and routine problems attending to. Add a separate backup service and at one point an outsourced email sending service and it was an expensive nightmare.


    A month or so ago we started getting the irritating database errors. I never did get to the bottom of that although it prompted a major spend on hardware.

    Then a few weeks ago we started getting massive server loads and the site kept dying. We couldn't work out why. Finally after ages of hunting in the wrong place we realised it was the chat system, Flashchat. It was creating lots of connections to our server then sitting there doing nothing. Eventually the server keeled over and died (highlighting along the way a config error which should have limited this effect).

    Flashchat has now been unsupported for a couple of years and the integration with VBulletin, ie. our site, didn't work without hacking by me. I don't know where the error was, but as evening users will have noticed, it has gone (a new system will be in shortly but I'm not sure which).


    this all prompted a move that we perhaps should have done years ago. A move into 'the cloud'. What this means in effect is that we no longer know what physical hardware we are using and a lot of maintenance, upgrading and configuration complexity is done for us. We can also change the size and performance of machines easily and it is fast and simple to bring new servers online in case of increased demand or hardware failure. Backups are easier.

    At the weekend then ARRSE moved to Amazon's cloud infrastructure and has been flying along since.

    We are still in a test period. At the moment we are paying by the hour at a suitably extreme rate for a simplified and part-managed cloud version of my white board sketch. Once we're settled and know what we need from Amazon our rates will come down as we move on to contracts, and I am confident that our demands will be affordable on Amazon although more expensive than our old hosting bills.

    We've set the 10th of April as the decision point for whether to stay with Amazon, what contracts to take on or whether to move house again. In the mean time there will be the odd bit of downtime while we test and adjust.

    I don't expect that was much interest to many! But, heh, it got me away from staring at server usage statistics for a while and I did get to show my artistic talent.
    • Like Like x 12
  2. Makes perfect sense to me.

    Say again all after "Bad CO ....". :)
    • Like Like x 6
  3. blue-sophist

    blue-sophist LE Good Egg (charities)

    That's both helpful and comprehendible!! "Thanks for sharing", as they say.

    Sorry you're faced with bigger charges, both now and in the future. But I guess you get what you pay for.

    Good luck :thumleft:
    • Like Like x 1
  4. Thanks for the update I'm sure your hard work is appreciated by all.
    • Like Like x 2
  5. maninblack

    maninblack LE Book Reviewer

    I sort of understood that, I'll run it by a 12 year old and get back to you.
    • Like Like x 3
  6. Why don't you have 2 ESXi boxes running bare-metal hypervisors. You can then have all the virtual servers you need, on 2 seperate bits of tin giving you high availablilty, VMotion for fail-over and redundency/disaster recovery etc.

    You could also then clone them for a test rig.
    • Like Like x 1
  7. Auld-Yin

    Auld-Yin LE Reviewer Book Reviewer Reviews Editor

    I think my PC has come out in sympathy with the ARRSE servers as it is giving me a huge fecking headache at present.

    Thanks for the update and I hope it is the success you intend. The downtime has meant I have had to spend some time with sensible people which is not good for my peace of mind.
  8. Good to let us know what's happening, as it's much easier to live with outages when you know what's behind them. Perhaps worth making this info a little more visible?
  9. We share your pain but not your workload. Thanks for telling us.
  10. TheIronDuke

    TheIronDuke LE Book Reviewer

    Or Amazon Elastic Compute Cloud to give it its full title. EC2 to its mates. I know people who swear by it, particularly its ability to deal with sudden spikes at 3.00am. If you are stuck for bodies you could try Amazon Mechanical Turk, MTurk to its mates, which appeared about the same time as EC2. Clearly a time when Amazon Execs responsible for naming things were experimenting with psylocibin.

    No drugs were employed in the production of Good CO's lovely picture. I wish to make that clear.
  11. When Dale decides to logon after a 15 hour boozing sesh?
    • Like Like x 2
  12. :wink:
    • Like Like x 2
  13. Good CO

    Good CO LE Admin

    Putting everything in the same physical machine and times my two isn't the cure we need. What's important is simplicity,security and lack of maintenance. Also I doubt one server would run our apache, web and search servers without huge hardware expense, so we're back in to the realms of load balancing and syncronization between the two, and the need for a third in case of failure. Amazon does the load balancing for us, does a mirror copy of the DB with automatic switch in case of failure on dedicated DB servers which I don't need to maintain. The maintenance is therefore two apache and one sphinx search server, plus a backup mechanism.

    Anywa, the reason I posted was to say that my experiment with a smaller, cheaper Amazon database instance this morning just failed and the site went down. Disappointing as it looked promising and happy with the load until 1300.

    It's back to the big expensive one that ran perfectly for the last few days. One of those 'test and adjust' moments that I can only realistically do on the live site unfortunately. Sorry!
  14. Good CO

    Good CO LE Admin

    You're right - that's one element of the system that deals with the web servers and so far I'm impressed.
  15. Good CO

    Good CO LE Admin

    No idea - sorry. Our bandwidth requirements are not big at all so no great issue for us. Most images and static files are served from a CDN so we are just sending HTML. The number crunching involved is very large, but the data output is small. I think we serve about 500GB a month of the top of my head.