Home Office Data Loss

Given that the data is still missing, and everyone is staying tight-lipped about it, I’m leaning towards an initial cause being made worse by a problem with the backups. If the backups (and schedule) were good, they could have restored the data, losing only one day’s data.

Backup data testing is still low priority for many. This was one of the resilience layers I would mention but it went to the back of the queue and stayed there.

On the HO incident six days (and counting) is surely an SLA breach. Perhaps the DR option is ok but for some reason not yet invoked, sometimes the "we are nearly there" delays invocation and after a few hours without getting there the phrase is heard yet again so further delaying invocation.

To me this is a mistake, if you are confident you can recover to agreed RTO/RPO and other KPIs then do so and (assuming all goes to plan) you meet SLA and are fine, dither and go over and you are in trouble.

The time for scenario evaluation, threat analysis, risk management/mitigation and planning is before the event, at time of disaster it is too late and utterly pointless to sit down and start discussing such matters. You are either ready to recover or you are not, so make sure it is the former and go for it.
 

WightMivvi

War Hero
Backup data testing is still low priority for many. This was one of the resilience layers I would mention but it went to the back of the queue and stayed there.

On the HO incident six days (and counting) is surely an SLA breach. Perhaps the DR option is ok but for some reason not yet invoked, sometimes the "we are nearly there" delays invocation and after a few hours without getting there the phrase is heard yet again so further delaying invocation.

To me this is a mistake, if you are confident you can recover to agreed RTO/RPO and other KPIs then do so and (assuming all goes to plan) you meet SLA and are fine, dither and go over and you are in trouble.

The time for scenario evaluation, threat analysis, risk management/mitigation and planning is before the event, at time of disaster it is too late and utterly pointless to sit down and start discussing such matters. You are either ready to recover or you are not, so make sure it is the former and go for it.
So true.

There are two types of organisations: those who test their backups, and those who have yet to suffer a catastrophic data loss. ;)

Disaster recovery? In my experience, senior customers often prefer to cripple their outputs (whilst whinging) rather than trigger contingency. Part of this is the belief that a fix is only a few hours away. This can be to overcome if there’s a good, honest and realistic relationship between the customer and supplier.

KPIs and SLAs? The cynic in me thinks they’re often a comfort blanket for the gullible. Any supplier worth their salt will make sure that it’s impossible to breach KPIs / SLAs enough to be sanctioned by the customer. I used to deal with the fallout of MOD IT contracts where an entire site could be without service for months without breaching 99% availability SLAs. Another favourite was the ”2 hour response time” KPI. A call from the help desk ay 1h 59m saying, “Hi, we have you call and we’ll get back to you next year” was a “response”.
 
So true.

There are two types of organisations: those who test their backups, and those who have yet to suffer a catastrophic data loss. ;)

Disaster recovery? In my experience, senior customers often prefer to cripple their outputs (whilst whinging) rather than trigger contingency. Part of this is the belief that a fix is only a few hours away. This can be to overcome if there’s a good, honest and realistic relationship between the customer and supplier.

KPIs and SLAs? The cynic in me thinks they’re often a comfort blanket for the gullible. Any supplier worth their salt will make sure that it’s impossible to breach KPIs / SLAs enough to be sanctioned by the customer. I used to deal with the fallout of MOD IT contracts where an entire site could be without service for months without breaching 99% availability SLAs. Another favourite was the ”2 hour response time” KPI. A call from the help desk ay 1h 59m saying, “Hi, we have you call and we’ll get back to you next year” was a “response”.

I had a nice contract a couple or years ago on backup risk that had materialised, the organisation had moved from one that had not suffered a significant loss of high value, single instance, unique data to one that had.

I sorted that out for them (and implemented appropriate due diligence test management). I also pointed out that in the time allowed this would only solve the backup issue but there were other material risks I had observed as time passed, including building outage or loss of access that would cause them immense difficulties. I suggested a solution for that which however went to the back of the queue. I finished up there late 2019 just before COVID struck and chatting to one of the guys there a few months later he mentioned that he wished they had implemented what I suggested. Not having the solution wasn't a showstopper and they continued to operate but this made it much harder.

Agreements (SLAs, KPIs) have their place but they, of course, are not the solution, so back to my point that senior management focus often is on Ts&Cs rather than balanced across all needs.
 
Why does Government keep outsourcing IT to Frank Spencer and Co, whilst Basil Fawlty continues to provide project management services? It would not surprise me if some organisations were still using reel to reel tapes.
"Tapes? What are tapes? Our wire-recorders are tried and trusted technology don'cha know?"
 
Can anyone find any news on where HO has got to with fixing this issue? I can't seem to find much and wonder if the HO hope is the longer they keep quiet about it the quicker it will fade from from everyone's memory.

If that is the case then it suggests to me they have determined recovery is not possible and so simply gave up trying.
 

theoriginalphantom

MIA
Book Reviewer

theoriginalphantom

MIA
Book Reviewer
Another favourite was the ”2 hour response time” KPI. A call from the help desk ay 1h 59m saying, “Hi, we have you call and we’ll get back to you next year” was a “response”.

we are to send satisfaction surveys to customers we know are likely to give a good response.
 

WightMivvi

War Hero
we are to send satisfaction surveys to customers we know are likely to give a good response.
I’m not surprised.

I used to work alongside the customer satisfaction team in MOD’s corporate IT department. To avoid such shenanigans they survey users at random and ask for feedback on services provided by MOD and its contractors.

That provides a useful contrast if any contractors tried to skew the results of their own surveys.
 

Tool

LE
I've worked in an environment where the OPs Manager walked into the Ops bridge (as you would now call it), and selected about 50% of the operators and team leads. Took them into a meeting room and told them to wait there. He then walked out of the Ops area and kicked the main power supply. I must admit that the DR worked very well in that instance and the outage was about 3 minutes (and an Ops Manager updating his CV).*
at the other end of the scale, anothe company I worked for had a DR plan that included cajoling 6 people (usually the same crowd) into going to the backup site and logging into the running operation from there. If they could, the DR exercise was a success.

"Backups? We don't need no stinkin' backups."

*Edited to add - he didn't lose his job, but it was a please explain to the full Board of Directors.
 

Latest Threads

Top