Monday’s server outage – what went wrong?

On Monday evening, Yacapaca was unavailable from 21:01 to 23:40 UK time due to a problem at our data centre. This is the longest unplanned outage we have had for several years, I’m relieved to say, but no less nerve racking for that. I requested, and got, a full explanation from Jon Roberts, the MD of Rackserv who manage the servers for us. It is quoted in full below. You may know that big data centres have huge diesel generators and massive banks of batteries that are designed specifically to make sure that the power never fails. So, why didn’t that save us this time? Read Jon’s full explanation below, or just go with my précis: it was ‘human error’ which in this case means somebody wired up a plug wrong. Oops.

At approximately 20:58 GMT on March 17th we became aware of an issue affecting our primary POP at Kent Science Park.  After a few minutes of diagnosis it became apparent that our diverse London routers could no longer reach our routers at KSP and as such we immediately dispatched 3 engineers to site to investigate. Our engineers were onsite within 15 minutes and unfortunately the issue quickly became apparent owing to the eerie silence that would normally be filled by thousands of whirring servers – there had been some sort of critical power failure. Within further 10 or so minutes the facility operator was onsite along with an electrical engineer and they quickly began investigating the issue and isolating an assumed affected area so service in the most part could be resumed as quickly as possible.

Unfortunately for us, the issue (will go into the specifics of this a little later) was within the proximity of our rack containing our primary routing equipment which meant whilst the vast majority of our server racks were now back online, they were without network capabilities for a relatively extended period of time as diagnosing continued. The facility operators electrical engineer found a Live-Neutral short in our rack containing our routing equipment, ATS and in-line, backup UPS (ironically designed to keep network capabilities up in such eventualities as this when primary power is unavailable!).  As such, the power delivery for this rack was moved from the UPS protected infrastructure to a dirty mains feed for further diagnosis.

Unfortunately despite everyone’s best efforts, the Live-Neutral short issue quickly vanished and all equipment came back online without an issue leaving us all scratching our heads as to what the issue could be and how we could ensure no re-occurrence. As a precaution, we decided to take our in-line UPS and ATS out of service thus removing the additional resilience they *ought* to offer but also removing what is most probably the initial cause and this rack was re-connected to the clean UPS protected infrastructure. For further background and transparency we have included excerpts from the facility operators RFO below:

Fault Synopsis At 20:58 on the 17th March the Sota NOC was alerted to a loss off communications to equipment within F25. • Sota technicians immediately investigated the issues and identified that the UPS A stack was electrically isolated on both sides due to upstream and downstream breakers tripping thereby causing a loss of power to some equipment within the F25 datacentre. • Investigative works commenced and the fault was tracked to F25 Suite A – Aisle 4 – All racks affected by the fault were isolated and tests were carried out to further locate the fault – this was tracked to a customer rack where a Live-Neutral short was found to be present somewhere in the rack. • At 21:45 the rack was isolated and Sota technicians proceeded to commence start-up of all affected racks. • The majority of the racks experienced power restoration by 21:55 – Remaining racks required additional works to overcome rack PDU in-rush current overload conditions. Further works were undertaken to ascertain the root cause of the fault. Investigations concentrated on the customer rack where the Live-Neutral fault was located. Sections of equipment were isolated and the fault condition was cleared. The team worked to re-create the fault but was unable to re-create the fault condition, with all equipment appearing to work as expected. Due to the equipment not providing a repeat of the fault scenario; our findings cannot be considered conclusive. At this time we believe the fault was due to the customers ATS equipment located within the rack. This ATS equipment has been removed for further investigation.

Mitigation Where low fault current scenarios are experienced; breakers local to the equipment are able to quickly clear the fault; this reduces the impact to the facility and other customers. On this occasion the fault condition resulted in a high fault current causing low level protection to be bypassed. Under these conditions the power protection systems are designed to pass the fault to high current breakers. The UPS will briefly go into Bypass Mode protecting the modules and infrastructure from significant failure allowing any faults to be cleared in a timely fashion by the relevant breakers. Regular inspection by a qualified electrical engineer of the facility should be undertaken – Sota Solutions completed thermal imaging and inspections of the main electrical distribution systems in F25 in the last 4 months – these showed no issues. Effective electrical discrimination across the facility should be in place. The settings of the breakers within the F25 facility should be examined to check discrimination remains optimal for the facility. Sota Solutions will validate the current discrimination.”

Moving on, throughout today we will be continuing to assist customers where ever possible to get them back online if they are not already (please do drop us a ticket if you are not back online) and in the coming days we will be looking at what avenues we can pursuit to improve any weaknesses we’ve discovered in the last few hours as well as looking at what resilience may be beneficial for business-critical clients (such as additional services from within our DR POP, backup solutions and or management solutions). In the interim, I would personally like to take the opportunity to apologise profusely to all though affected by last night’s disruption and offer our assurance that we will pursuit every avenue until we are confident the issue has been identified and resolved.

Many Thanks,

Jon Roberts

RackSRV Managing Director

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s