What happens when disaster strikes?

March 21, 2018 Uncategorized 0

One of our clients recently had some issues with their server, and while this wasn’t a complete disaster we did have to go through many elements of their disaster recovery procedure to get their systems back to running the way they should. What happened, and what can be learnt from the process? By Matt Kirby.

Background

This particular client’s setup has elements common to many small-to-medium sized businesses:

A recent Dell PowerEdge server, covered by a Dell Pro Support Mission Critical 4 Hour Response warranty (which came in very handy!), performing the following roles:

Active Directory Domain Controller (so manages the network and handles logins and security)
Main data store (including all company data, accounts and payroll data, roaming profiles, user home drives)
Microsoft Exchange (and associated services: Outlook Web Access (webmail) and ActiveSync (for iPhones etc))
Blackberry Enterprise Server Express (BESx)
Microsoft Sharepoint (intranet, used by staff for general notices)

All data on the server is backed up to an external LTO3 tape drive each night, using Symantec Backup Exec. However, only company data (and email) is backed up, as there isn’t enough time each night (or enough space on the tape) to backup the configuration of the server as well. The tapes are changed regularly by on-site staff, and the backups are regularly monitored by ourselves. There had been a few minor backup issues recently (Exchange wasn’t backing up fully on occasions) but generally the backup procedures were working well.

The event that started it all

Towards the end of one particular working day (around 5.30 PM) the server stopped responding to the network – everyone’s Outlook went into “offline” mode and wouldn’t connect to the Exchange server, and no-one could access any of the networked drives on the server.

Fortunately I was on site already as part of a scheduled visit, so I was in a position to immediately start investigating (rather than our client having to wait until one of us could get there).

Initially I was hoping that this was a simple networking issue – a loose or failed cable somewhere or a crashed switch. However when I got to the server I discovered that it also wasn’t responding to the locally attached keyboard and mouse. I also noticed that the hard-disk activity light was constantly flickering – which was unexpected as no-one could access the server, and the server appeared to be doing nothing – so there should be little to no hard-disk activity.

At this point we had no option but to forcibly shut the server down. This isn’t ideal as servers (and PCs in general) do not take too kindly to having the power unceremoniously removed, however there was no other way of initiating an ordered shut-down as the server was completely unresponsive and showed no signs that it would recover itself.

Pulling the plug on a running server is always a heart-stopping moment – however this was doubly worrying as there appeared to be hard-disk activity – and anything the server was attempting to save could potentially be lost. However, I knew that backups were in a fairly good state so at worst we’d be looking at a rebuild of the server – not ideal, but not the worst.

So, power comes out, goes back in, and the server is asked to start…….

…..and after while it does!

After it’s eventual start I can login, and there’s nothing worrying in the event log and the server seems to be happy. It’s back on the network, able to serve files to people, email is flowing. However, the hard-disk activity light is still flashing away quite merrily, so I suspect that there is something else going on – as there should be some hard-disk activity, but not this much.

Going on a RAID

As I suspected that there was something going on with the disks, I loaded up Dell’s Server Administrator utility to have a look at the status of the hard-disk RAID array. This showed that one of the disks had failed so the RAID array was “degraded” – meaning it was still working but definitely not happy. As this server had a 4 hour response warranty I got on the phone to Dell (by this point it was close to 7 PM).

After going through some initial diagnostics the Dell tech agreed that one disk needed replacing. He requested a copy of the RAID logs, and after looking at them he noticed an issue with “uncorrected errors” on another disk. This could be an indication that this disk was about to fail as well, or it could mean that the RAID array was broken and had a “punctured stripe”. He arranged for 2 disks to be sent out that night (they arrived on-site around 11 PM) and he advised me to replace both disks and restore from backup. He also recommended restoring from a backup prior to the first “uncorrected error” – which was several days previously. He said that more recent backups could contain corrupted data, and restoring this corrupted data could cause the RAID to become unstable again.

Restoring from backup was a possibility, however if we followed Dell’s advice this would mean our client would have lost a couple of days’ worth of work and emails – which is completely unacceptable if this can be avoided. It would also mean the start of a lengthy recovery period before our client was fully functional.

What is RAID?

RAID stands for Redundant Array of Independent Disks (this was originally Redundant Array of Inexpensive Disks – but modern RAID disks are not inexpensive!). RAID is a way of combining several disks to appear as one single disk. There are many ways of configuring RAID, the most common being RAID 5, which offers a balance between speed, fault-tolerance, and cost. This requires at least three disks, and it spreads data across the disks and generates “parity” information – which can then be used to recreate any damaged or missing data. With a RAID 5 array if a single disk fails the server can continue running, replacing the missing data by using the parity data. However if two disks fail all data is lost.

As the server was reporting that everything was fine apart from one dead disk and two instances of “uncorrected errors” we formulated another plan. We would replace the dead disk, let the RAID array rebuild, and then once it was happy swap out the other disk (in case this was close to failing) and initiate another RAID rebuild.

The repair

So, we put in the replacement disk and left the RAID array to rebuild. A few days later, once the RAID rebuild had completed (with an additional consistency check for good measure) we swapped out the other disk, and initiated another RAID rebuild.

During this time the server was functional- however it was running slowly, as the RAID system was working like crazy to get everything back to normal. Our client was advised that their staff could continue using their PCs, however things would be slow and it would be best to avoid anything that would put too much strain on the server.

After a few days the server seemed happy – the RAID had been rebuilt with two new disks and everything was working. On checking the RAID logs we couldn’t see any further “uncorrected errors”, although there were some “corrected errors” – which was a slight concern (as we don’t like to see any errors on a RAID system, corrected or not).

After a while our client reported that the server still seemed slow, and on asking them to take a look at the hard-disk activity light they reported that it was constantly flickering. Looking at the RAID logs there were now even more “corrected errors”, which was a worry – it’s great that they are being corrected, but worrying that the RAID system was encountering errors. Time for another call to Dell!

The next stage

After talking to Dell it became clear that the RAID system did indeed have a “punctured stripe” – basically the RAID array configuration had become corrupted, which introduced errors that the RAID controller was having to correct on-the-fly. Dell advised that the only way to recover from this was to delete all data, configure a new RAID array, and restore from backup. Without doing this the server would continue to run slowly, and prevent it from being able to properly recover from any potential future errors.

However this Dell technician said we could restore from any backup, as the errors were internal to the RAID array and the data being presented to Windows wasn’t affected. Additionally, even if one of the original uncorrected errors had corrupted a file restoring this file would not break the new RAID array. As creating a new RAID array is a destructive act he also agreed to send us four new disks (covered by the “Mission Critical” part of the warranty)- this way we could rebuild the server on new disks and if we hit any issues we could at least put the current set of disks back in and be back to where we started.

You may remember from the start of this blog post that our client was only backing up their data each night, as there wasn’t enough time or space on the tape to backup the server configuration. In this instance a full recovery would require:

Creating a new RAID array
Installing Windows (and required drivers)
Configuring networking
Joining the server to the network (under a different name)
Installing Active Directory
Installing and configuring Exchange
Installing and configuring Sharepoint
Installing and configuring Blackberry Enterprise Express
Installing and configuring Symantec Backup Exec
Restoring data (files, email, Sharepoint intranet, etc)

This would be a very slow process – ultimately our client would get their server back with all their data but there would be several days with no data access at all, and it would be a further number of days if not weeks before everything was back to working how it did. From a “Disaster Recovery” point of view this was acceptable – our client would ultimately get a working system back without any loss of data. However, from a “Business Continuity” point of view this wasn’t great – as there would be extended downtime while the recovery took place.

A unique position

Usually when a server needs a rebuild it’s an emergency, and by definition completely unplanned. It’s a stressful time for all concerned, as everyone is relying on plans that were drawn up ages ago and having to make additional decisions as they go along.

However, out client was in the very unusual position of having a system that needed to be rebuilt, but was currently working (albeit slowly, and possibly not for long). This meant that staff could be forewarned of any down-time and could prepare contingencies. It was also a great opportunity to test the disaster recovery plan in a fairly safe manner – as anything we were about to do would be in a non-destructive manner (thanks to the extra 4 disks supplied under the Dell warranty).

As the server was currently fully functional we decided to take a full backup of the server (rather than the existing data-only backup). As we knew this wouldn’t fit on to a single backup tape we lent our client a 3 TB USB drive for us to save the backup to. We also knew that the backup wouldn’t complete in the usual overnight backup window, so we scheduled some downtime for the following day- as we wouldn’t want people changing data as we are running the backup.

Having a complete backup of the server would mean that the recovery process would be far quicker – as we would have less configuring to do. With a full backup (data and configuration), this would be the recovery procedure:

Create a new RAID array
Install Windows (and required drivers)
Install Symantec Backup Exec
Perform a “full-system” recovery

Enter IDR

Symantec Backup Exec has an optional module called “Intelligent Disaster Recovery”. This is usually a chargeable extra, however as our client is a registered charity their license gives them access to the full Backup Exec suite, which includes IDR.

We decided to take this unique opportunity to test IDR in a live setting. IDR takes a little bit of preparation, but the main benefit is a greatly simplified recovery and reduced recovery time.

To perform a recovery using IDR you need:

A custom IDR CD for the server being recovered- this is created by Backup Exec, and needs access to the Windows installation disk
A Backup Exec Disaster Recovery file (.DR), preferably created after the most recent backup
A complete system backup

Once you have those pre-requisites the actual recovery would go something like this:

Create a new RAID array
Boot from IDR CD, select what you want to restore, and leave it for hours
When IDR is complete, reboot server, and restore Exchange database and any SQL databases (e.g. Sharepoint data)

The recovery: redux

We estimated that the whole process of running a full backup and doing a full recovery would require at least one day of downtime, possibly two- if everything went well. We did consider doing this over a weekend, however we decided against this for two reasons, both relating to what would happen if things didn’t go to plan:

If we encountered problems we would potentially require additional support from Microsoft and/or Symantec – and this would be easier to obtain (and cheaper) during the working-week
Doing a recovery over a weekend would give us the two-day window that we estimated would be needed – but without any contingency. Any problems would push the recovery into the start of the working week, without an easy way of forewarning staff

We arranged with our client for them to not use their server on a Thursday and possibly Friday, which would give us the weekend as a contingency. This did mean that staff would definitely have downtime during the recovery, but as there was advanced warning they would be able to make contingencies ahead of time. If this had been done over the weekend there was the chance that there would not be downtime during the working week, but if things went bad there would be down-time and unprepared staff, which would be arguably more disruptive than planned down-time.

In the end we did need to use the contingency of working into the weekend – mainly due to the backup taking far longer than we had estimated. However, once the backup completed the recovery took far less time than we estimated – we usually expect recovery to take longer than the backup does, however with IDR this was greatly reduced.

Once the IDR portion of the recovery completed we then moved to restoring Exchange and any SQL databases. The restoration of Exchange worked beautifully, however when we came to restoring the SQL database used by Sharepoint we did hit some problems. The restore wouldn’t work as the database wasn’t running, but the database wouldn’t run as it couldn’t find the files – as they needed to be restored. This would have required a potential re-install of Sharepoint, but as we had the old disks we were able to restart the server on those, copy the required files over to another machine, and reboot the server on the new disks to copy the files back. The database was then able to start without issue. This issue would not have occurred if there had been at least one full-system backup before the one we were restoring – when BackupExec takes a backup of an SQL database it saves a copy of the file next to the current database, which is also included in a full backup. When doing the restore you will then have an un-attached (old) copy of the database on the disk, which you can use to start SQL, and then restore the most recent version of the data.

Although the whole process did take longer than anticipated by Saturday evening our client had a server that was completely back to it’s old self.

Here’s how long each stage took:

Complete backup of server to an USB2 disk (around 1TB of data and “system state”): 40 hours
Creation and initialisation of new 1TB RAID array: 90 minutes
IDR recovery: approx 8 hours (this is estimated, it ran overnight and was complete when we checked in the morning)
Recovery of Exchange and SQL databases after IDR completed: 6 hours

What can we learn?

Firstly, a Dell warranty on a server is worth its weight in gold! Dell delivered replacement disks the same evening when we discovered the initial failed disk. Dell were also invaluable in advising us on our options. Without the warranty the server would have been down until replacement disks could be sourced, purchased and delivered. We also wouldn’t have had the luxury of being able to retain a working set of disks as we rebuilt the server on to a new set.

In going through this process we were also able to record actual times required for certain processes. We knew that there wasn’t enough time each night to backup the server data and configuration, but we didn’t know exactly how long it would take. Although in this instance doing a full backup meant extended downtime we can now use this knowledge to make improvements to the existing Disaster Recovery Plan, and also to improve Business Continuity options.

Going forward, we now know:

Having the option of using IDR vastly reduces the recovery time – but does need a recent full backup in order to work. We already knew that the data-only backup only just completes overnight, and a full backup of this server takes around 40 hours. We need to improve the existing backup system to ensure that regular full backups are taken. We also now know that the first IDR backup doesn’t quite give full recovery – as some SQL databases are unable to start without some data to work with
We need to improve Business Continuity options – so if there is an issue staff are able to continue working (of a fashion) without needing to access the server
This sequence of events also highlights the risk of having one server cover multiple roles – when this server went down there was no access to email or files until the server was fixed. Having a separate server for files and email reduces this risk
Many of our clients have a Disaster Recovery plan, but few factor in Business Continuity – being ultimately able to restore all data is great, but what does the business do in the meantime while waiting for this recovery to happen?

In this instance our client ultimately ended up with a fully working server and no loss of data. However, there was loss of staff time while the recovery took place. We are using this event to go through existing procedures with our client to see what can be improved – it was unfortunate that this happened, but we should take this opportunity to learn from the process to make future unplanned events easier to manage, with less risk to the business.

Our Journal