Ok. For this kind of major system problem, one that recurs, you want to figure out what’s causing the failure and take actions so it stops recurring. In my experience, when a major system fails multiple times, there are underlying factors that cause another failure, then another, then another.

It doesn’t matter what you call this person, but you need to pick someone that will determine the cause of the problem leading to the outages. This kind of work is often called causal analysis. Just for the use of this document, I’ll call this person the causal analysis manager.

If you’re fortunate, you’ll have someone available with the needed skill set to participate in the system recovery. Your causal analysis manager will remain mostly on the fringe, watching, making small suggestions regarding recovery of evidence.

It was a lot of years ago, and I’m going to take a bit of poetic license, but I’m going to talk a bit about a major, major system problem my group dealt with. When I took over the group we had a recurring problem. I think we ended up calling it the “system slowdown problem.” My sole contribution was establishing some ownership and getting the best people involved for the various roles needed. They did the work.

Why am I bringing this up?

You will find sometimes your causal analysis manager and your incident manager are in opposition. Your causal analysis manager is wanting to slow down, gather information, take system logs, etc. They will need this when it’s time to roll up their sleeves and figure out why this problem keeps occurring.

Your systems operations manager is likely to be focusing on only what is required to get the system back on-line. They have probably been pressured in the past to get the system restored.

I had been informed prior to taking my role at the bank, in clear language, that, if I didn’t fix the system slowdown problem I would be fired and to be absolutely certain I could get the problem fixed or… just don’t take the job. Great way to feel welcomed.

At first, the efforts were solely focused on restoring the system.

On average, three days each week, at the worst time of day, the entire system would slowly fade. Response time would lag. And lag some more. And even more until the system was unusable, unusable by hundreds and hundreds of call center advisors in six call centers across the country trying to support tens of millions of credit card holders.

By the way, there was concern from the regulators as to how well the card members were served. This concern originated from the overwhelming card member complaints.

No pressure.

There’s a quote, something like “no plan survives first contact with the enemy.” It’s a derivative of a longer quote from a Prussian, Field Marshall Helmuth Karl Bernhard von Moltke back in the 1800’s.

Your plan, if you choose to incorporate any of the above concepts into your plan, will suffer. Ours did. It took a few iterations until we got things sorted. Initially, we fell to the pressure to just get the system back on line. You may have the same problem.

Over time we were able to communicate our plan well enough to get the skeptics calm enough for us to take the additional time required to gather the evidence.

The final report identified, if I’m recalling correctly, seventeen different causes of various types that contributed to the system problems.

Read More

As soon as you can, you need to appoint one person to manage communications for the incident response.

One of the worst things you can do is to expect your incident manager to do all of their communicating. I’ve made that mistake.

I’ve also made the mistake of thinking I could handle all the communications. I’ve made that mistake, also.

Everyone will want to know what’s going on, will want to know when the system will be “up”.

Pick someone to be your single point of contact. You will be busy being the leader of your organization.

Your communication manager needs to have excellent listening skills, good judgment, and, especially, the ability to translate from “tech” to “business” and back again. Most of the people wanting to know status do not want jargon. They don’t really understand the technology and you don’t want them to.

Lots to know about listening. Active listening. Reflective listening. Empathic listening.

All three are important. In the case of a major system failure, you want your communication manager to be able to move from active to reflective and back as needed.

Your major stakeholders do understand business and business language. They need to be given status updates, when appropriate, in their own business language.

Your communication manager has three primary tasks:

  • Hear and understand your incident manager
  • Translate from tech to business and from business to tech
  • Inform, in business terms, key stakeholders re status information

You could do this. You should not do this.

Get a dedicated ‘professional’.

You be the leader.

Read More

There are really two distinct tasks required to restore the system. Each task leads to the expected deliverable of a restored system; however, each task requires a bit different role or roles to execute.

The two tasks:

  • LEAD the effort to restore the system
  • Take specific ACTIONS TO restore the system

The above take two skill sets that do have some overlap but have some essential skills that are typically not found in a single individual.

You need to delegate the actual management of the response to someone. I call this person the incident manager. The resolution of the incident is their only responsibility. Make it very clear that no one else is responsible for “running” the response to this incident.

You have to make it very clear to your incident manager that all of your resources are available to them to get the system back online. And, you have to back that up.

You have to run interference. You have to keep the “wrong” people from talking to the incident manager.

The incident manager does not need to be answering random questions.

The rank of the question asker doesn’t matter. Your incident manager must be allowed to focus 100% of their effort on resolving the problem and restoring the system.

“How much longer before the system’s back online?”

“What broke?”

“What caused this to happen?”

“What are you going to do to keep this from ever happening again?”

All of these questions are important, but not at the moment. At the moment, your incident manager is responsible for getting the system back online.

What kind of person makes a great incident manager?

You need someone smart that is mentally agile. You need someone willing to lead, to cause things to happen. They need to understand the needs of the users. They need to understand your organization’s mission, your vision, and your values.

They need to be highly skilled at analytical troubleshooting and have exceptional judgment.

Finally, they need solid technical skills in the areas being affected.
I’ve been fortunate. I’ve known two people that made world-class incident managers. Both individuals worked me at the nuclear plant and at the bank.

Both managed multiple, significant incidents with bull-dogged determination to fix the problem and restore service to the business users.
So, while the incident manager is leading the system restoration, there is typically a systems operations manager involved along with their systems operations staff.

Your systems operations manager and their staff are critical to both the on-going operation of the system and to support the system restoration and the preservation of available evidence to be used to complete causal analysis.

A really good system operations manager will be great at dealing with the typical problems that come up on a day-to-day basis. They will be great at dealing with most major system failures.

Every now and then, though, a system failure occurs that is an order of magnitude beyond anything considered to be typical. These are the kinds of major system outages being addressed here.

These are the outages that affect a huge percentage of your user base, cost the organization massive amounts of lost time and money. These kinds of system outages usually recur.

Note: If we were talking about the typically occurring significant system outage that your great systems operations manager can handle, it wouldn’t be YOUR problem. Would it?

So, during the effort, led by the incident manager, to restore the system your system operations manager and staff will be quite busy.

They will be pulling and interpreting system logs, etc. They will be investigating possible actions to take and the likely results. They will be taking a variety of actions and evaluating the results.

So… why do you need a dedicated incident manager as opposed to just having your system operations manager doing it?

You don’t always.

It depends entirely on the capabilities of your system operations manager in context of the nature of the current major system failure.

There a are a few things to consider.

How evolved is your system operations manager? How evolved is your system operations group?

If your system operations manager is new in position, they probably do NOT have high competence in the managing and leading part of the role. They probably are not fully engaged with your vision, values, strategies, etc. They probably require a lot of direction.

If your system operations manager has been in position a bit, is ready or nearly ready to move into their next, more advance, more responsible, role, they probably do have high competence. They are probably well aware of your vision, values, and strategies. They probably require little or no direction.

Maybe your system operations manager falls somewhere in the middle.

Unless you are convinced your systems operation manager is absolutely capable of both leading the incident response AND managing the system operations staff in support of the response, you have to dedicate someone to be incident manager.

What kinds of problems might you have when you insert a “foreign” incident manager to muck about in your system operations manager’s world?

Resentment and fear.

You can avoid these by making sure you have created a safe environment for your folks and by making sure your key people, in this case your system operations manager, are secure in their positions. But, in the face of a major problem such as this, you do not have the time to set things right, if they are not so at the time.

What you can do is this.

First, talk to your system manager. Make it crystal clear to them that the absolutely own the system and its operation and that’s their position is in no way at risk. Let them know that they are responsible to provide full support to the incident manager.

Next, meet with them both. Lay out your expectations for the work to restore the system. This your job as the leader. Make it clear the differences in the two roles.Make it clear they both have your full support.

I know this may all sound complicated. Later in this document I will summarize in simple terms.

Know that when this is occurring you will do fine. Follow the simple steps and use plain language. Be clear. Be succinct. Be the leader.

Read More

So you’re in charge. And the system is down. What do you do? First, you have to make it absolutely clear who is accountable. The question “Where does the buck stop?” has to be answered, and it has to be you.

This is an ownership thing. It’s important to know who owns the problem. It’s important for everyone to know who owns the problem.

In my case, at the bank, our major system would fail at least three times a week, usually at peak traffic times. The system would be degraded to the point it was completely unusable for several hours.

This happened, literally, in my first week in the job. When I asked the question “Who owns this problem?,” the answer I received was Development, meaning the Application Development Division.

In this case, the actual consumer or user of the system was the call center advisors. The Application Development Division worked directly with the customers, meaning the call center advisors and their executives.

My people were part of the System Operations Division, and they did not work directly with the the call center advisors or their executives.

So what would happen is this. The system would go down. The call centers would become extremely upset. Their executives would call the Application Development Division who would come yell at my people.

In doing this, the Application Development Division essentially made themselves the owner of the problem. To me, this was clearly a system operations problem.

Nothing was changed. There were no new code modules implemented, no changes to the database. Nothing of the sort. But the system would go down.

The first thing I had to do was make sure that it was very clear that my group, the Systems Operations Division, owned responding to this problem.

I remember very clearly explaining to both my own people and the people in the Application Development Division that Operations owned this problem. We owned the problem.

So first you have to make sure it’s really clear who owns the problem, and it’s you.

Read More

Situation and Problem

You’ve just started your new job as a director, a senior executive responsible for a number of mission critical computer systems… and one of those systems is failing… regularly. Thousands, millions maybe, of users are unable to use the system. What do you do?

Read More