In the last 3 weeks we have experience 2 major public IT failures. I am slightly confused how they could both become so massive.
I entered the IT industry nearly 30 years ago. Like so many in those days, I entered from a previous career and fell into the profession. Falling into it, we brought experiences with us from previous careers. One thing we all acknowledged was the fallibility of technology, or anything mechanical. We made sure we reduced risk wherever we could and where we couldn’t we had a fall back process. So what is happening in this world today? (oh ‘eck I’m sounding like me Dad!).
This weekend, we saw British Airways (BA) grounding all its flights because of a catastrophic computer failure. It is reported that this was caused by a massive power surge at their data centre. We saw images of airports crammed with disappointed passengers and what looked like confused employees. As I sat watching the news, my only thought was how could a massive power surge cause such damage?
30 years ago when I was involved in setting up my first data centre for a 1,500 bed hospital, we knew that a power surge on our delicate computing equipment could cause a failure we could find it difficult to recover from. So we installed a clean supply and a bank of Uninterruptible Power Supplies to smooth out that supply and maintain us long enough for generators to power up in the event of loss of power. I can only assume that BA didn’t do this in their data centre. One that controls thousands of passenger journeys every day! We also anticipated that a failure could result in loss of important patient data, so we had a ‘hot fail over’ where we could pick up normal IT services, albeit on a reduced capacity, but a service nevertheless. And that ‘hot fail over’ was housed away from the main computer facility. Finally we had manual fallback procedures which we had practised. These procedures included staff assigned the responsibility of communicating with system users and hospital users. I didn’t see any of this on the news reports.
The measures we put into place were not cheap at the time and took some justification, weeks of writing business cases I remember. When we lost power to the computer suite and no users noticed, it was justified.
Since then, technology has moved on and many of the things that were expensive then, are now pennies now. Especially when you put them against the cost of initial implementation and the cost of losing systems businesses have become to rely on. Finally there is the reputation damage, which is not always acknowledged. This is anecdotal, but the view in my local on Saturday was heavily toward avoiding flying with BA for the foreseeable future. Sure this will be forgotten over time but what will be the immediate cost?
Have we become complacent? With our always on society and IT Service companies offering 99.999999% up time are we forgetting the fallibility of these devices? I would suggest we probably are and it is time to re-evaluate.
One of the first lessons I learned about managing IT was three letters C I A . No not the US intelligence agency, but maintaining Confidentiality, Integrity and Availability. Those three letters stand as much today as they did all those years ago, even now, whenever I consider changes to IT or assessing risk, I recite CIA. In fact they are probably more poignant. Risks haven’t reduced, they have changed and in some areas increased. I don’t know how much this failure will have cost BA or how much WannaCry will have cost the NHS. One thing that is certain, it will be more than putting technology, people and processes in place to reduce and manage the risks!
My final though this morning was: what will the ICO make of both of the NHS and BA incidents? They both involved personal data. One involved damage through encryption and the other non availability at the point of need. Watch this space. There could be even more cost winging its way.