The Yacapaca outage: what happened

We have just recovered from our worst-ever outage, at 25 hours and 9 minutes. Here is what happened.

Our principal database server went down at 09:16 Monday 22/09/14. We moved quickly to reboot it, but this failed. Concluding it was a hardware issue (it turned out to be a failed motherboard), we switched over to the backup and got Yacapaca working again around 11:30.

By 12:00 it was apparent that the backup servers were just not up to the job. Yacapaca has grown in complexity several-fold since we configured the server cluster, and we had failed to review it in the meantime. Rather than leave teachers to start sessions that might be spoiled by delays and dropouts, I took the decision that shutting down for the day was the lesser of two evils.

Meanwhile, we had already turned to our hosting partner RackSRV to help us out. They were absolutely incredible! They built, installed and configured a one-off monster server, and had it handed off to us by 17:30.

The guys in our tech team worked shifts through the night to reinstall and re-index the database. By 07:00 we had it working. In fact, the only thing that had gone wrong is that it had been allowed to go public and a couple of teachers had already signed on. They had to be disappointed, though, because I demanded another three hours of testing before I would allow it to be launched.

Finally, at 10:25 on Tuesday, Yacapaca was back on line.

Right now our focus is on monitoring and bedding-down the new server, but rest assured that there will be a thorough post-mortem, a detailed action plan and considerable investment in hardware to make sure that you do not suffer this kind of disruption again.

I would like to extend my apologies for the disruption all this caused, and express my sincere appreciation for the patience that all of our users have shown.

[Update 26/9/14] It turns out our trials were not over, because yesterday we were hit with a massive and sustained DDoS attack between 09:30 and 11:00. There is absolutely nothing a small organisation like ours can do to defend against those kinds of attacks; it’s like trying to secure your house against a main battle tank. We had no choice but to shut down and wait for it to end.

[Update 1/9/14] Finally, we seem to be getting back to a good level of performance. It has taken several all-nighters by the technical team here to balance and optimise our server cluster. I’m not calling the all-clear until we have had 48 hours of fast performance, though. Watch this space.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s