There are really two distinct tasks required to restore the system. Each task leads to the expected deliverable of a restored system; however, each task requires a bit different role or roles to execute.

The two tasks:

  • LEAD the effort to restore the system
  • Take specific ACTIONS TO restore the system

The above take two skill sets that do have some overlap but have some essential skills that are typically not found in a single individual.

You need to delegate the actual management of the response to someone. I call this person the incident manager. The resolution of the incident is their only responsibility. Make it very clear that no one else is responsible for “running” the response to this incident.

You have to make it very clear to your incident manager that all of your resources are available to them to get the system back online. And, you have to back that up.

You have to run interference. You have to keep the “wrong” people from talking to the incident manager.

The incident manager does not need to be answering random questions.

The rank of the question asker doesn’t matter. Your incident manager must be allowed to focus 100% of their effort on resolving the problem and restoring the system.

“How much longer before the system’s back online?”

“What broke?”

“What caused this to happen?”

“What are you going to do to keep this from ever happening again?”

All of these questions are important, but not at the moment. At the moment, your incident manager is responsible for getting the system back online.

What kind of person makes a great incident manager?

You need someone smart that is mentally agile. You need someone willing to lead, to cause things to happen. They need to understand the needs of the users. They need to understand your organization’s mission, your vision, and your values.

They need to be highly skilled at analytical troubleshooting and have exceptional judgment.

Finally, they need solid technical skills in the areas being affected.
I’ve been fortunate. I’ve known two people that made world-class incident managers. Both individuals worked me at the nuclear plant and at the bank.

Both managed multiple, significant incidents with bull-dogged determination to fix the problem and restore service to the business users.
So, while the incident manager is leading the system restoration, there is typically a systems operations manager involved along with their systems operations staff.

Your systems operations manager and their staff are critical to both the on-going operation of the system and to support the system restoration and the preservation of available evidence to be used to complete causal analysis.

A really good system operations manager will be great at dealing with the typical problems that come up on a day-to-day basis. They will be great at dealing with most major system failures.

Every now and then, though, a system failure occurs that is an order of magnitude beyond anything considered to be typical. These are the kinds of major system outages being addressed here.

These are the outages that affect a huge percentage of your user base, cost the organization massive amounts of lost time and money. These kinds of system outages usually recur.

Note: If we were talking about the typically occurring significant system outage that your great systems operations manager can handle, it wouldn’t be YOUR problem. Would it?

So, during the effort, led by the incident manager, to restore the system your system operations manager and staff will be quite busy.

They will be pulling and interpreting system logs, etc. They will be investigating possible actions to take and the likely results. They will be taking a variety of actions and evaluating the results.

So… why do you need a dedicated incident manager as opposed to just having your system operations manager doing it?

You don’t always.

It depends entirely on the capabilities of your system operations manager in context of the nature of the current major system failure.

There a are a few things to consider.

How evolved is your system operations manager? How evolved is your system operations group?

If your system operations manager is new in position, they probably do NOT have high competence in the managing and leading part of the role. They probably are not fully engaged with your vision, values, strategies, etc. They probably require a lot of direction.

If your system operations manager has been in position a bit, is ready or nearly ready to move into their next, more advance, more responsible, role, they probably do have high competence. They are probably well aware of your vision, values, and strategies. They probably require little or no direction.

Maybe your system operations manager falls somewhere in the middle.

Unless you are convinced your systems operation manager is absolutely capable of both leading the incident response AND managing the system operations staff in support of the response, you have to dedicate someone to be incident manager.

What kinds of problems might you have when you insert a “foreign” incident manager to muck about in your system operations manager’s world?

Resentment and fear.

You can avoid these by making sure you have created a safe environment for your folks and by making sure your key people, in this case your system operations manager, are secure in their positions. But, in the face of a major problem such as this, you do not have the time to set things right, if they are not so at the time.

What you can do is this.

First, talk to your system manager. Make it crystal clear to them that the absolutely own the system and its operation and that’s their position is in no way at risk. Let them know that they are responsible to provide full support to the incident manager.

Next, meet with them both. Lay out your expectations for the work to restore the system. This your job as the leader. Make it clear the differences in the two roles.Make it clear they both have your full support.

I know this may all sound complicated. Later in this document I will summarize in simple terms.

Know that when this is occurring you will do fine. Follow the simple steps and use plain language. Be clear. Be succinct. Be the leader.