A Testing Week – of Technical Capability, Disaster Recovery and Sleep Deprevation…

The past 7-10 days have been an interesting journey to say the least. There have been many events, both personal and professional that have completely taken over my normal routine and resulted in some insanely long days, a serious lack of sleep and a backlog of things to do.

But what I really wanted to blog about was the work side of things…

A typical week for me involves lots of random events, since despite my best efforts, and those of my team we haven’t really got final say in many of the things that happen around us – customers can have IT issues needing our attention and systems can break, unexpectedly and without any real sort of warning.

What is unusual is to have several different issues collide leaving us with a huge amount to do and really testing the skills of our team as well as all those backups, disaster recovery plans and such that everyone talks about but very few actually have in a meaningful way…

Monday started with a bit of a bang, and our core e-mail service which is very much a flagship, highly advanced service decided it had finished providing e-mail and was going to take a nap. Actually that is somewhat understating it. In reality, we had a complete loss of service for a portion of our customers as the database storing the e-mail, contacts, calendar entries and such for them had become corrupted.

This wasn’t something that was immediately apparent since the errors being thrown originally were relating to the lack of backups (or rather that the system could not work out if a backup had been taken recently). With a little bit of digging we eventually determined there was a database issue.

Troubleshooting these kinds of things is far from trivial – you are dealing with live customer data, on a operational platform with hundreds of people connecting, and literally hundreds of GBs of data. So everything you do can have a large impact on the reliability of the services being offered. Key is the integrity of customer data, after all they trust you to keep it safe, handle the backups and such.

I can tell you that is a pretty large responsibility and one I don’t recommend taking on!

Fortunately I’m incredibly paranoid about backups and having plans to handle this sort of stuff, and so comes the challenge of how to restore servic to our customers. While in an ideal world simply bringing customers back online is the ideal scenario, you have to take a number of precautionary steps when handling this type of data, so for one business day a subset of our customers (which generally is not a whole customer – with a couple of exceptions for legacy reasons we need to address) had no service.

Getting the system to a safe state to begin the task of looking at restoring backups and rebuilding mail data was particularly challenging because the very systems that are supposed to provide redundancy for other types of problem (for example loss of server through fire, theft or simple failure) haven’t been operating as flawlessly as they ought to – and we have had a number of projects ongoing to address this.

This made life even more tricky, since we then had to interrupt service for all our other customers 2-3 times throughout the day in order to progress this partcular issue – normally this can be avoided, but the particular condition the fault had arisen in meant the only real fix was to dump everyone off and force a clean start on the system – having to do this in the middle of the day is not something I like doing, but experience tells you that restoring data is a long winded, time consuming process – so waiting another 5-6 hours makes a big difference to how quickly you can bring everyone online.

With those challenges addressed we can finally move onto restoring backups – this is again a problem scenario since we have also been working on some issues with our Backup Platform – indeed the specialist teams at Microsoft have been working with us on some issues. So the nearest) local backup server does not have the data we want.

For most people this would be the point of panic since the backup server, well, just isn’t any use unless it has backups. Fortunately, and as I mentioned before I am incredibly paranoid about backups so we have more than one backup system in place – so we turned to our off-site backups. as expected, this had the data we wanted and could refresh the main backup servers to ensure we could perform a restore.

Sounds simple enough right? Well not quite – there is another issue – we know that restoring customers can take many hours (don’t forget we are talking about bringing back hundreds of GB of data – which in reality means we have to build a “good” database first, then reimport customer mail back to the real mailboxes they have. This would mean we leave customers with no e-mail for hours, possibly days.

Accordingly we need to get customers operational again so they can at least receive new mail until we can bring back historical mail – to do this, we effectively have those affected customers added to alternative databases – they end up with a blank mail account – but anything not yet delivered will turn up and allow them at least some form of communication.

So problem solved now? No – you have to make sure you can bring back messages that were received before the failure as well (eg between the backup and the failure – those messages are “at risk”). Again, more paranoia on my part means that we have configured the systems that process new incoming/outgoing messages to hold a reasonably large “cache” of things they have processed in recent times.

A few requests later, and we now have customers online, with new messages as they arrive, and messages received between the backups and point of failure. Not bad, but we still have the millions of messages they have received historically – in some cases our customers could have many years of historical e-mail data. It is pretty critical to some of them we can get this back, since we deal with all manner of businesses including solicitors and such who must BY LAW retain content – so that challenge of ensuring integrity is critical.

Our next step then is to bring back everything we have in the backups – this takes some time as I have said before – we offer pretty generous storage allowances, so an individual customer may have anywhere up to 25GB of e-mail data in a single users mailbox – so the volumes of data are not to be taken lightly.

We undertake this in 3 steps – first – recover a full copy of the corrupted database to the nearest point we can to the failure (as we take several backups a day, we can be reasonably close ) – in this case the nearest backup was from around 12 hours before (the interim backup had failed – which we suspect was an early warning of the corruption!).

Next we have to import that mail back into a customers mailbox – one user at a time as it were (actually it is a little slicker than this but that is essentially what happens behind the scenes).

The final step is to fix the duplicate issue – since we have “replayed” mail as far back as we can before we restored anything, we have to complete a process which will “correct” the state of e-mail – this means that some messages may need moving to a different folder, may have been deleted by the user and needs to be re-deleted, or may have been edited/updated in some way. They also store calendar data – so we need to make sure that we do not duplicate customer appointments, AND we also have to ensure that when they have added a new appointment in the meanwhile, if it conflicts with an item that was in the last backup we handle it correctly.

If you are still reading, you are probably struggling to understand how we even keep track of all this – the answer is actually in having a plan for disasters. It would be lying to say you can document every possible issue, request or command that may need to be entered, and the reality is your plans give you the basic outline and make sure you complete all key steps – but you need capable staff with expertise if you are going to survive.

The above gives you an idea of just ONE of the many incidents this week – combined with Server Hardware failures, more problems with the main Backup System that I touched on above (which actually resulted in us needing to rebuild it and then, perhaps ironically restore from a backup our backups…), and to finally top things off, an incident with one of our Datacentre partners which resulted in us losing servic at one facility for about 2 hours – the cause? A break in and theft of some pretty critical equipment.

I won’t even bother to explain why that should never have happened, but it did!

The end result of this is that while customers complain that e-mail is down or that they have some technical issue (which I understand from the perspective that it affects them THERE and THEN), what they do not see is how much effort goes into making sure when something goes wrong (and it will – anyone who tells you otherwise is lying, and you will get burned), that having good plans and processes means you can get them back up and running.

It can take a little time depending on the issue (a combination of the time to identify the issue – the first major barrier and one people forget when yelling for answers on “when will it be working again”) – and then executing appropriate plans to get things up and running, but the reality is that an outage of this nature is very much the exception, and having multiple different failures even less likely, yet when it does eventually happen, having a decent IT partner that has genuinely ticked all the boxes to make it possible to recover is worth far more than you pay them.

It is this very hidden aspect of IT that is the hardest to explain, the most difficult to get people to “buy into” in terms of realising the difference between a quality provider and a “does the basics” operator – everyone tells you they have a plan, but ultimately you will only ever find out if they are right once the problem has already happened.

A few days on, and I’m actually pretty pleased with how we handled things overall:

– We contacted those customers directly affected by the core issue – by phone, in person to make sure they knew we were aware and dealing with the issue – not many companies bother with this.

– We regularly updated our Service Status feed to provide information as best we could (not always the easiest thing when you are trying to focus on getting things going) so that we could help customers plan around the issue

– Our plans for Backups – and importantly Backing Up The Backups with more Backups of Backups, and then more Backups again seems crazy to many people – and indeed we have been told on more than one occasion that our plans are far more extensive than providers with considerably more financial clout and resource than us, yet they proved that they are worth the considerable investment and effort we put into them

– Having a team of people who have good skills which complement each other allowed us to make rapid progress to restoring service.

– Actually having a test platform in our “Labs” where we can dry run intended recovery steps to minimise the chance of loss of data was invaluable

– For the isolated cases where the customers client software (normally Outlook) got confused with the changes, we provided full remote assistance at no charge to help bring those customers back online with normal service.

…plus numerous other steps.

Hopefully this gives those who are interested a good idea of just how busy a week can be for me as someone running a small business – and perhaps an insight into just how much happens that you never see each day when you log onto your computer and begin using things such as e-mail. So the next time something goes wrong, remember that the decision to use a quality provider isn’t about when things have been going smothly, it really is about what they do when the muck hits the fan.

And it will.

If you think that was all I did, then ignoring the things I did personally (which will remain personally) I found time to help one of our key customers move offices, went to see a new customer in the middle of the country, and get some other issues fixed. While there was plenty more I should have done but haven’t, I think 20 hour days were pretty much the limit!

One last thing – I’m actually pretty appreciative of our customer base over this incident – we know it has had impact to them despite our efforts to minimise it, but on the whole they were patient, understanding and helped us by avoiding constantly calling for updates – whether that is because they trust us to get it sorted, or were reassured by our attempts to reach them I don’t know.

(Having said all this – if you are looking for a quality IT provider that actually cares about ensuring that problems get fixed, and when something really awful happens you don’t get left in the lurch, give us a shout – www.vpwsys.net!

Tags: , , , , ,

Leave a Comment