Situation and Problem
You’ve just started your new job as a director, a senior executive responsible for a number of mission critical computer systems… and one of those systems is failing… regularly. Thousands, millions maybe, of users are unable to use the system. What do you do?
Congratulations. You’ve just been hired as an executive in a fast-paced, high-pressure organization.
You show up at your new office in your new role. You’re in charge. You’re the boss.
It feels good. It feels like you have accomplished a major career goal. You call the shots. You’re where the buck stops. There’s no question about this.
First item of business, you meet with your managers for a status.
It’s a mix of good and bad news.
You have a lot of great people on staff. That’s good.
A lot of your projects are proceeding well. Work items are being completed… mostly on-time and on or under budget. That’s also good.
Oh. One more thing.
A mission critical system is regularly failing. It’s becoming absolutely unusable. It impacts hundred, thousands, maybe even millions of users.
This been going on for a while.
The senior executives are livid.
Your managers are frantic.
Users are storming the gate.
Everybody turns to you.
Remember. You are in the position you wanted. You wanted to be in the “position of accountability.”
What are you going to do?
This happened to me. Twice. Once at an operating nuclear power generation plant and then again at a major bank. Ouch.
In the case of the nuclear plant, the system used to plan and schedule work for refueling and maintenance outages was periodically failing. Hundreds of work packages couldn’t be completed. Work was delayed. Many, many dollars were being wasted.
In the case of the bank, the system used by call center advisors to serve the many millions of credit card holders was slowing dramatically to the point of being unusable. The bank was in trouble with regulators due to poor customer service. Call center advisors were jumping to competing companies because their compensation was being hit hard.
What do you do?
There are a few actions you can take. Some are absolutely required. Some should be done in a particular order.
You have to:
Establish who “owns” the problemRestore the systemCommunicateDetermine the cause(s) of the problemSupport your teamPut an end to the problem so it doesn’t happen againThank everyone
We’ll cover each of these, one at a time.
When you have a major system down, you have to do two important things. The way you think about them and go about them are quite different, but, it takes both to get the system back running.
You’ve just started your new job as a director, a senior executive responsible for a number of mission critical computer systems… and one of those systems is failing… regularly. Thousands, millions maybe, of users are unable to use the system. What do you do?