Ok. For this kind of major system problem, one that recurs, you want to figure out what’s causing the failure and take actions so it stops recurring. In my experience, when a major system fails multiple times, there are underlying factors that cause another failure, then another, then another.
It doesn’t matter what you call this person, but you need to pick someone that will determine the cause of the problem leading to the outages. This kind of work is often called causal analysis. Just for the use of this document, I’ll call this person the causal analysis manager.
If you’re fortunate, you’ll have someone available with the needed skill set to participate in the system recovery. Your causal analysis manager will remain mostly on the fringe, watching, making small suggestions regarding recovery of evidence.
It was a lot of years ago, and I’m going to take a bit of poetic license, but I’m going to talk a bit about a major, major system problem my group dealt with. When I took over the group we had a recurring problem. I think we ended up calling it the “system slowdown problem.” My sole contribution was establishing some ownership and getting the best people involved for the various roles needed. They did the work.
Why am I bringing this up?
You will find sometimes your causal analysis manager and your incident manager are in opposition. Your causal analysis manager is wanting to slow down, gather information, take system logs, etc. They will need this when it’s time to roll up their sleeves and figure out why this problem keeps occurring.
Your systems operations manager is likely to be focusing on only what is required to get the system back on-line. They have probably been pressured in the past to get the system restored.
I had been informed prior to taking my role at the bank, in clear language, that, if I didn’t fix the system slowdown problem I would be fired and to be absolutely certain I could get the problem fixed or… just don’t take the job. Great way to feel welcomed.
At first, the efforts were solely focused on restoring the system.
On average, three days each week, at the worst time of day, the entire system would slowly fade. Response time would lag. And lag some more. And even more until the system was unusable, unusable by hundreds and hundreds of call center advisors in six call centers across the country trying to support tens of millions of credit card holders.
By the way, there was concern from the regulators as to how well the card members were served. This concern originated from the overwhelming card member complaints.
No pressure.
There’s a quote, something like “no plan survives first contact with the enemy.” It’s a derivative of a longer quote from a Prussian, Field Marshall Helmuth Karl Bernhard von Moltke back in the 1800’s.
Your plan, if you choose to incorporate any of the above concepts into your plan, will suffer. Ours did. It took a few iterations until we got things sorted. Initially, we fell to the pressure to just get the system back on line. You may have the same problem.
Over time we were able to communicate our plan well enough to get the skeptics calm enough for us to take the additional time required to gather the evidence.
The final report identified, if I’m recalling correctly, seventeen different causes of various types that contributed to the system problems.