Router fails, no packets dropped.

January 29th, 2014 by

This morning one of our routers in our Cambridge data centre stopped reporting bandwidth data to our billing system. We investigated and whilst it was still routing packets without issue, it appeared to be experiencing hardware failure.

We’ve powered the router down, pending full investigation on our data centre visit this afternoon. Currently all traffic from our Cambridge site is being handled by our other router. This seamlessly failed over with no customer impact.

Depending on your choice of terminology ‘Redundancy has been reduced to N’, or ‘The network is at-risk’. In Mythic Beasts we like to speak English so this translates to, if something else fails before the router is restored to service, there is a risk of a network outage to our Cambridge data centre.

Update : Friday 31st we fully restored our network to it’s usual redundant configuration by replacing the router with a similarly over specified replacement. Customers may have received free bandwidth for some of this period.

Saturday outage report

January 27th, 2014 by

Edit: we’ve now received a report from Telecity, so have updated this report to take account of this.

Further edit: explanation for extended outage in one rack added.

Summary

  • A power interruption occurred at around 8:09am on Saturday 25th January, affecting multiple floors in Sovereign House.
  • For the most part, the interruption was momentary (around 500ms), but long enough to cause a reboot of affected equipment.
  • One of our racks was without power until 10:38am, due to a tripped circuit breaker.
  • Our staff were onsite at 11:15am, and then worked to restore services that had not come back up cleanly. One such server was our SOV DHCP server which will have affected any virtual servers configured to boot via DHCP.

Details

The power outage was caused by an interruption to the external mains power supply, followed by a failure of the DRUPS (Diesel Rotary Uninterruptible Power Supply) system that is supposed to ensure that power to the data centre is maintained during such a power cut.

The DRUPS system contains three separate units with sufficient capacity to cope with the failure of any one unit. Unfortunately, in this event, the unit that failed did so in a manner that triggered a shutdown of the other two. From the Telecity report:

… one of the units on DRUPS System 1 experienced a fault on its synchronisation card. This fault caused the unit to go into overload which, in turn, had a direct impact on the remaining two units. During the overload condition, the faulty unit back-fed the other two units which, for protection and per design, automatically shut down.

At this point the system went into raw mains bypass mode (i.e. bypassing the UPS systems, and connecting the data centre load directly to the mains). This occurred around 2 minutes after the original mains supply failure, by which point the mains supply had been restored, but there was a 500ms interruption as the bypass occurred.

This much is consistent with our observations, which is that in all but one rack, the logs on our remote PDUs did not record an outage, but the vast majority of equipment attached to them did: the management interfaces in these PDUs draw very little electricity and are known to be able to survive very short power supply interruptions.

As noted above, one of our racks experienced a more extended outage. This was due to the circuit breaker on the power bar being tripped. This was noticed and rectified by data centre staff inspecting racks following the initial outage.

At this point, the faulty DRUPS unit is out of service, meaning that whilst the power supply is protected, there is no redundancy until the unit is repaired and tested.

Conclusion

Whilst we are certainly unhappy about the outage, at this point we have no cause to question our choice of data centre provider. Sovereign House is a major UK internet hub, and is a purpose-built 6 floor data centre, built to the highest industry standards. With the best will in the world, there will always be faults that can take an entire DC, or significant parts of it, off-line, and for this reason, we would always recommend that mission-critical applications are served from multiple sites. Independent routing ensured that our facilities at other sites were unaffected by the Sovereign House outage.

That said, the aftermath of the outage has revealed some areas in which we can improve. In particular, the extended outage of one rack had a knock on effect to connectivity of others. Following Sunday’s scheduled maintenance work, we’re now in a position to improve our network topology to make it more resilient. We are also planning improvements to our Virtual Server hosts and database servers to ensure that they can recover more quickly following such an outage, and we have already made changes to our support systems to make them more resilient.

Beyond directly fixing the affected units, Telecity are also planning improvements to their communications during such an incident. This will help us direct our efforts more effectively.

Notes

For the avoidance of doubt, this interruption was completely unrelated to the network upgrades scheduled for Sunday evening, which went ahead as planned.

Finally, thank you to all customers who monitored our status page during the outage.

More bits

January 10th, 2014 by

At the end of last year we took the decision to significantly upgrade our two connections to LINX – our busiest connections to the outside world.

This turned out to be a good plan as Mythic Beasts got a Christmas present in the form of a new company bandwidth record, thanks to two customers, Blinkbox Music and Raspberry Pi getting a substantial spike in hits as people unwrapped their Christmas presents.

And it seems that the excitement of all the presents hasn’t worn off, as the Christmas day record has just been toppled by a new all time high yesterday. With the Blinkbox apps very high in the free music app charts, we’re not expecting it to stand for long.

Raspi.tv

January 9th, 2014 by

Here’s an unsolicited customer review of a migration of a dedicated server to one of our managed virtual machines from Alex at raspi.tv who’s building a 9inch HDMI 1080p screen.

New Year, New Server At mythic Beasts

You can find the original twitter conversation at @Mythic_Beasts.

Coping with Christmas

January 7th, 2014 by

Our latest blog post is on the Raspberry Pi website. Coping with Christmas