I know our customers are keen to know what happened last week when we experienced a network outage. A good number of you have contacted us to find out when they can expect the full Service Interruption Notice (SIN) or reason for outage.
Typically, these are turned around quickly, however, while the cause of the outage is straightforward, the impact and results are very much less so, and because of this a detailed investigation is underway both internally and at our fibre provider. Because of what happened and frankly because we take issue with our provider over what happened, this might take a longer than normal, or at least be reduced to the bare facts which are that we had a connectivity failure because of multiple fibre cuts which were caused by a farmer trenching across his fields. He dug a 5inch, 6ft deep micro trench across his fields, somewhere in Cheshire and went right through multiple bundles of ducted, armoured fibre, severing both of ours.
I know you want more than that, and rather than just give you the technical document, which you will get in due course, I thought it may be better to provide a more human version of events. A word of warning – this is a long post.
Background
Our St Asaph facility has two primary fibre connections, that exit at different sides of the building. They then follow separate paths before one heads West toward Bangor and Anglesey and the other goes East toward Manchester. The westerly leg goes to several points of presence (POPs) until Bangor where it returns along a separate route back East to Manchester. The Easterly leg also calls in at a few POPs before also reaching Manchester. When they both reach Manchester, they enter one of the two Iomart (or parent company) data centres at that location before joining our national network.
The westerly and easterly fibres then form a ring, or loop, with each fibre travelling separate physical paths all the way around, the idea being that if a fibre is broken, the traffic can reroute the other way and still make it to Manchester. The service is supposed to be a resilient fibre loop, and in fact, that is what is in the product description of the service detail of the service we purchased. It actually says “resilient”.
As an aside, some of you may have seen a diagram on an Iomart website that shows the connections to our datacentres. You’ll see that from our St Asaph location it’s shown as having a connection to London directly, and you should ignore that. I mention it because I’ve seen it referred to by customers and it gives the wrong impression. It was no doubt done by somebody in media or marketing who took the theoretical and made it look literal, and after the last outage we had, I took issue with it, however, I’m annoyed to see that it’s still in use. Hopefully, after this incident, it will be corrected. The proper overview diagram to use is this one here.
So, to the issue itself, the fibre cut. It’s not usual for fibres, cables and pipes to be severed, it happens all of the time, but normally you don’t hear about them as the alternate route your provider of whatever utility is affected comes into play. The most you may notice is a brief flicker of the lights or stutter in your streamed movie, and then maybe you’ll spot the roadworks or get stuck in them on your way somewhere, as crews race to repair the damage.
This event was unusual though, as the resilient service we have been paying for years for, was suddenly not so. Our monitoring picked it up immediately, and our network team identified that it was fibre related and escalated to our supplier. They, in turn, dispatched crews to locate the damage. Rather than cover the specifics of that, though, I’ll direct you to www.hostingukstatus.co.uk where you can read the full chain of events from beginning to end.
While we left our network team and our provider liaising over the problem, we escalated to senior staff internally and management contacts at the provider to try and find out why our resilience wasn’t working out and what they were going to do about it.
I have to say I was confused. You see, I have been at Hosting UK from the start. I provisioned and racked our very first server in 1997, before we even became a company, and I’ve personally ordered every single circuit, fibre, rack and datacentre component, at every location we have ever had a point of presence in, and I *know* for sure that when we ordered the fibre loop we have between St Asaph and Manchester it was sold to us as a physically diverse and resilient service.
Now, at this point I must go back to early 2015 when planned engineering works were taking place near the M6, Junction 19, when it became apparent, via the works notifications we received from our provider, that our diversity was not as it should be.
As soon as this lack of diversity and resilience became apparent we took issue with it, and after some heated exchanges with our provider, plans were drawn up, and a map was supplied showing where the fibres should have been and showing a new, more diverse route where the fibres would be moved to, with dates set for the works required to implement the change.
As far as we were concerned, that was that. At no time was any indication given that the works would not be completed, or any indication given that they had not.
Throughout that time and since we have been paying for the resilient service that it is now quite apparent we have not had. As you can imagine this is now the subject of intense focus.
So, what about redundancy?
Because of the unrelated (apart from the same fibre provider being involved) outage on 29th March a number of areas for improvement were identified. Among them were additional fibre connections to various points and data centres on our corporate network, among them one to our facility in St Asaph, basically to ensure that what did happen on 31st May/1st June, in a cruel twist of irony, could not happen. This was ordered from a totally independent and non-related national carrier, however, due to lead times on such things, it was not in place at the time of the incident.
Because of other issues identified, plans had also been drawn up to implement additional DNS resilience by relocating DNS servers to ensure even further resilience, along with our support desk, status pages and phone systems.
When this recent incident hit us, some of the work had already been done however we expedited what we could of the remaining work during the outage period and contingencies were put in place to ensure that at the very least the phone and webchat lines of communication were open and handled by staff at other office locations around the country.
The responses callers may have had might only have been to confirm an issue and to refer customers to our status page for updates, but at least we were able to get the message to our affected customers, while meanwhile, other team members manned twitter and our status pages for the duration of the incident to ensure that information was available. While all of this was going on, our exec teams co-ordinated to put in place plans to ensure that should the outage become more prolonged we could restore service at other locations if need be.
It may not have seemed to our customers that much was happening, but as someone who was present throughout the whole thing, I can assure you that a lot was happening behind the scenes.
Eventually, service was restored to customers at 13.45 on 1st June after a total outage time of 16 hours and 48 mins.
Where we are right now.
As I stated at the beginning, there is a very detailed investigation taking place into everything relating to how we, like you, became victims in this situation. As I’ve also stated the wheels are in motion to ensure that a third totally independent fibre is installed and everything that can be done to expedite that is being done.
DNS services are in the final stages of their redistribution and we’re making other improvements to ensure that DNS entries are cached longer globally.
Our support system and status pages now live in a totally separate facility and are replicated continuously to ensure we can bring them up elsewhere at a moment’s notice.
Our phone systems are being adapted and updated as part of a wider program to ensure greater resilience and accessibility.
In short, we are looking at everything from the top down and back up again, and where we see that we can make changes we will.
Some of our customers have asked why we had the same provider to deliver our fibre across our national infrastructure, and the reason is that when you are looking for national dark fibre infrastructure, there are not that many suppliers to choose from, and even less that can scale to deliver truly national fibre that can link 10 UK datacentres. Of course, we could have fibre from an assortment of providers, but then most of those providers are reselling the very infrastructure that we have purchased directly from the source. At a national scale, your choices become more limited.
I’ve also been asked where the geodiversity was, and how could one farmer could take down two fibres, and as I’ve mentioned, the answer to this question is being vigorously pursued. It shouldn’t have been possible.
I know there are some among our customers who have decided that this has all come too late for them, and have already decided to move away, or who have gone already. I’m sorry about that, I really am. Not just because of the lost business, but because we let you down to the point that you no longer had any faith in us. Believe it or not, that matters, to me, to the team at Hosting UK and to our colleagues within the group. We’re proud of who we are and what we do, so when somebody leaves because they no longer have faith in us it hits home in a very personal way.
If you work in IT you’ll know that it’s not just a job, it’s a vocation, and so when the infrastructure and systems you’ve built, spent days and nights working on, lost sleep over, sweated, bled and yes, sometimes even shed a tear over fails and the customers you’ve built relationships with, and proudly homed on that same infrastructure doesn’t deliver, it matters a lot.
I’ve been asked how people can have confidence in us, and how after two outages so close together, they can have any confidence that it won’t happen again. Right now, any answer I give sounds hollow I know, but what I can say is that I’ve been here for over 20 years, and in that time I can count on one hand, with fingers to spare, how many total outages we’ve had, but as I said at the end of the last incident, that does not mean we ever become complacent. I also said at that time that we would learn as a result of what happened, and we did, but we were unlucky that some of the things we were putting in place to improve things had not been fully implemented yet.
I can only promise you that we will be better and that we will do everything humanly possible to restore your faith in us and to prove to you that we are the company that we know we are and can be.
Phil C Parry
Sales & Services Director