SharePoint Best Practices Posts- Day 6 of 10. Planning for failure in your SharePoint setup.

Guest post by Chris Mckinley- Senior SQL/SharePoint Developer

An occasional day in the Systems Office at Twynham School

The System’s phone rings…

“Hello IT.”

“The system has lost all my files!”

“Where we’re your files stored?”

“On the Gateway”

“OK, do you know where on the Gateway?”

“Oh! Don’t get technical I don’t understand all this VLE stuff”

“Where did you last see the files?”

 “In the revision resources section on the History gateway”

“OK, one second… I’m just looking now…I can see about 10 files there”

“Yes, it’s the one called ‘important course material 101′”

“I can see the file on my screen here”

“Yes yes, but it’s blank when you open it”

“OK…Click, click, click…Right can you open it again”

“Ah it’s back now”

“I just restored a previous version. It looks like you had overwritten it”

“Humm, yes well, thanks”

This typical scenario can occur in any organisation. Is it a failure? No, but imagine the scenario if the IT guy says, “Nope, sorry you overwrote it. It’s gone.” That’s not such a good end to the story. For this blog post I wanted to focus on planning for failure when it comes, as it undoubtedly will. I’ve split failure into three types and in the table below suggested the likelihood of each event happening over a year. 

Type Likelihood
Human Failure – Files deleted or overwritten 100%
System Failure – Server downtime 100%
Natural disaster 0.09%*

I’ve taken these values from Mike Watson’s #spbpuk session.

What is the impact of these failures in a school environment? Working through the list; lost files will always result in frustrated end users and blame on IT, regardless of who made the error. It’s harsh, but frustrated end users are end users that trust the system less and don’t want to use it. System downtime is unavoidable with power cuts, server maintenance and critical failures leading to networks going down. Natural disasters can’t be predicted, but such events happen, and no matter what you’ll need to get things back online. If you lost your database server how long would it take to supply staff members with group lists and timetables? Would you even know who your 1500 students are? The main focus of the session I attended at the SharePoint Best Practices Conference was preparing for failure and planning how you can manage the process to getting your SharePoint setup back up and running.

There are some simple things you can do to ease the frustration of end users like teaching them about the recycle bin and enabling versioning – that will give you some protection against user failure. You could even purchase third party tools such as AvePoint. Don’t rely on the out of the box SharePoint backups (there are not very good), or even SQL backups, these will only restore databases, not items.

To fully prepare for a SharePoint failure we must begin a process of planning and documentation. If you have a plan that says in a worst case scenario we will be down for 8 weeks and loose 3 years of data and someone signs that off, your covered. But beyond being covered for a time of failure, planning and documentation will help to establish how you will get your SharePoint back up and at what cost.

Work out some statistics showing how long you would be down for and how much you would loose? When you know this you can design your recovery solution. If you want 3nines availability (99.9% up time-that’s only 8hrs downtime per year) then you’ll need fault tolerance on your servers, if you want regular backups that will cost in storage. How much is your data worth? It is hard to put numbers on data in a school when there are no customer databases and multimillion pound contracts but instead John Smith’s Y11 Maths coursework. If he looses that a week before hand-in then that’s not a happy time for anyone.

Planning for failure is all about preparation and balance. What can you afford to implement and what can you afford to sacrifice? Again it is time for the hard miles and documentation. Plan what you need, what you can afford, and put it into action. And of course when things go wrong you can wave that signed off document and say… don’t worry we’re backed up and will be back to normal by the end of the day as we said we would be.