The accident at Chernobyl on April 26th, 1986 is the very button of fashion at the moment, what with the eponymous HBO miniseries. Now, I’d hate for you to think of me as a bandwagon jumper in choosing now to write something about Chernobyl, but rest assured, I was interested in Chernobyl long before the series. Admittedly, this makes me something much worse than a bandwagon jumper — a nuclear accident hipster, the worst kind of hipster — but it does mean that I recently read Adam Higgenbotham’s Midnight in Chernobyl. This is an excellent blow-by-blow account of the disaster that reads more like a thriller, and reveals much behind the decision making at each stage of the crisis. Moreover, just this June I finally visited the Chernobyl exclusion zone and the abandoned town of Pripyat.
As someone involved in the design of two incident management products (Opsgenie and Statuspage), all of this got me thinking about incident management. Thankfully, when we manage incidents involving software products the stakes are rarely as high as for nuclear reactors, but the many mistakes — and triumphs — of the Soviet response to the disaster illustrate how we should and should not respond to crises that befall us. With that in mind, below are a few lessons I took from Higgenbotham’s telling of the tale. I won’t dwell too much on the course of events — the book, this article or the HBO series are all excellent ways to familiarise yourself with exactly what happened — but instead focus on a few insights.
An open culture ensures everyone is working with the right information
As those who watched the HBO series should now well know, the ultimate cause of the accident was a critical flaw in the Soviet RBMK-1000 model of nuclear reactor. Boron-carbide control rods that were meant to be inserted to ameliorate the nuclear chain reaction were tipped — for complex cost-saving reasons — with a material (graphite) that would in fact dramatically increase the reaction, potentially leading to a runaway chain reaction and an explosion.
Plant operators at Unit 4 of the V.I. Lenin power plant in Chernobyl hit an “AZ-5” SCRAM button that they thought would insert all of the control rods and quickly quash the ongoing nuclear reaction. Instead — and much to their surprise — this caused the runaway reaction noted above, followed by a huge steam explosion.
What the TV show misses is that this flaw was well known to the relevant Soviet authorities. NIKIET (the Soviet nuclear R&D institute) compiled a report in 1980 that identified nine major design failings, and made it clear that accidents were in fact probable during normal operation of the plant. The nature of the Soviet system — and the unwillingness of powerful vested interests to accept fault — meant no action was taken, either to redesign the RMBK-1000 or to train plant staff.
Instead, operating procedures were updated to account for these flaws, a solution that assumed, as Higgenbotham puts it
“… so long as they followed the new operating instructions closely, human beings would act as promptly and as unfailingly as any of the plant’s electromechanical safety systems”
I don’t need to spend much time critiquing the panglossian naiveté of this attitude. What is worth discussing in more detail, however, is that the plant personnel were not told about these failings. This meant that the reactor did not behave as they expected, and that they took a course of action that led to disaster. Hiding salient information between fiefdoms within an organisation can cause problems at the best of times, but during a crisis it will increase the time taken to resolve a problem — and indeed, lead to more incidents arising due to improper operation.
The lesson for us
An open culture ensures teams do not hit unexpected problems that other teams could have made them aware of — and ensures that they have the best information available to resolve an incident. As I’ll discuss later, making blameless postmortems available to everyone within an organisation is one of the best ways to achieve this.
Be prepared for every outcome
The Soviet authorities were infuriatingly blithe about the prospect of severe nuclear accidents in the USSR. Even Valery Legasov, the hero of the TV show, opined in an article published prior to the accident that a Three Mile Island could never happen in the USSR because its operators were better trained and safety standards higher than in the US. And indeed, the USSR had no formal plans — of any kind — for such a large nuclear disaster as Chernobyl. All contingency plans foresaw a single, short release of radioactivity.
Instead, their contingencies considered what they called a “maximum design-basis” accident, even though the engineers were well aware that far more severe “beyond design-basis” accidents were theoretically possible — including the actual accident that befell the Chernobyl plant. Despite this, no action was taken to prepare for this possibility.
The lesson for us
Plan for every eventuality, and prepare runbooks and procedures for how to deal for every possible outage and incident. Believing a worst case to be unlikely won’t help to prepare you for when the improbable does happen.
The importance of wargaming
One way to prepare for the worst possible outcomes is to war game — to run simulations and drills of an incident, and see how well your software — and your team — responds. Often in software engineering this is called chaos engineering, and is an important way to stress-test your systems. In response to Three Mile Island, scientists in the West had been running computer simulations of worst-case scenarios for many years by the time of the Chernobyl accident.
Soviet physicists, on the other hand, felt their systems were so safe that they didn’t need to run such simulations. All of which meant that they underestimated the probability of such an accident, didn’t understand the likely outcomes of such an accident and did not help to define a state response to these accidents.
The lesson for us
Wargaming incidents helps you to understand how resilient your systems and procedures are, in the face of both probable and improbable incidents. It also provides vital incident management practice for your team. Casual optimism will not help you when your systems go down.
Accurate monitoring leads to appropriate actions
Having interviewed a huge number of teams about their incident management response, it’s still all too common that the first a company hears about an outage is from their customers. Despite the proliferation of monitoring tools over the past 10 years, many teams have very little monitoring set up for the software services they operate. This has a number of deleterious effects — response and resolution times slow down, identifying the precise cause of a problem becomes harder and the system can become something of a black box.
Such was the case at Chernobyl. After the initial blast, a nuclear fire continued to burn, reaching as high as 2700°C. And yet, scientists didn’t really know what was going on within the ruins of Unit Four. They didn’t know what was happening to the graphite, if there was burning zirconium nor the impact of attempts to stop the fire with tonnes of lead and boron, and even their measurements of the amount of radionuclides being released had a 50 percent margin of error.
The Soviet authorities may have been able to put in place some monitoring systems prior to the accident, yet the nature of the blast and fire makes it doubtful that many of these would have functioned after the accident. Still, if they been in possession of accurate monitoring inside the plant then Soviet scientists would have been able to identify the best course of action much sooner. Instead, as we’ll see, authorities needed to direct multiple resolution efforts at the same time, the uncertainty about the best course of action spreading resources more thinly.
The lesson for us
Accurate monitoring of your systems not only alerts you to problems sooner but most certainly can help you to act more swiftly and with greater confidence to resolve incidents.
Admit what you are facing, and communicate it promptly to the entire team
One of the most famous scenes in the HBO series is Anatoly’s Dyatlov’s (the chief engineer during the ill-fated shift at the plant) refusal to accept that Legasov could have seen irradiated graphite on the ground around the plant, so deep was his denial that the reactor could have blown up. In reality, this denial was widespread — many of the people tasked with responding to the incident took far too long to accept that the reactor was totally destroyed. As this dialogue quoted in Midnight in Chernobyl reveals, it was Boris Scherbina (head of the government commission to Chernobyl) who was initially in denial:
“…I’m going to demand that the minister of energy restart all units..”
Scherbina fell silent for a few moments as Gorbachev spoke. At last, Scherbina said “Okay”, and replaced the receiver in its cradle. He turned to [Ukrainian energy minister] Sklyarov.
“Did you hear all that?”
He had. He was appalled. “You can’t restore the reactor because there is no reactor” he said “it no longer exists”
“You’re a panicker”
This denial meant that important information was not being communicated to stakeholders, in this case the Soviet government, Soviet citizens and to the rest of the world. This lengthened the time to resolve the incident, since the necessary resources could not be brought to bear. Indeed, Scherbina’s denials continued:
Scherbina repeated to Scherbitsky [the Ukrainian party chief] what he’d just told Gorbachev: a can-do shock-work action plan of fantasy and denial. The he handed the phone to Sklyarov.
“He wants to talk to you. Just say what I was saying”
“I don’t agree with what comrade Boris Edvokimovich is saying” Sklyarov said “we need to evacuate everyone”
Scherbina snatched the phone back from the energy minister’s hand “he’s a panicker!” he yelled at Scherbitsky. “How are you going to evacuate all these people? We’ll be humiliated in front of the whole world!”
Scherbina was more concerned with Soviet prestige, with admitting that there had been an incident, than with taking the necessary steps to contain it. This denial led to a delay in the evacuation of Pripyat, which may well have cost the lives of many people (though we’ll never know the full death toll that can be attributed to the accident). It also delayed the actions needed to contain the fire and limit the release of radionuclides.
The lesson for us
When a major incident occurs, limiting what you tell your organisation — perhaps hoping to hide the scope of the outage — will only result in a longer time to resolve the incident. You won’t be able to quickly acquire what — and who — you need to fix the problem. Furthermore, other teams will not understand why there is an outage nor know when it can be resolved, and your organisation won’t prioritise resolving the incident as much as it should. Accurate, prompt internal incident communications that keep your organisation up-to-date with how the incident response is progressing is an essential part of an effective incident response.
Honest external communications avoid rumour, anger and confusion
Outside of the plant staff and the team tasked with fixing the aftermath of the accident, there were three main groups of external stakeholders to the Chernobyl accident. First, the Soviet citizens in the nearby city of Pripyat and the surrounding villages. To this group, the same denials were at first repeated. In a meeting of Pripyat city leaders at 9am on April 26th (about 8 hours after the accident) the deputy party boss for the Kiev region admitted that there had been an accident and stated “the conditions are being evaluated right now. When we have more details, we’ll let you know”, which does sounds eerily familiar to the initial incident communications we send to our customers via Statuspage.
The difference, of course, is that the authorities did know a good deal more about the danger Pripyat was in than they let on, but it took Scherbina until 7am the following day to order an evacuation, and until 1.10pm for the evacuation order to be broadcast to the public via radio-tochki (small radios in every apartment that received important Party broadcasts). It contained the following
“Attention! Attention! Dear comrades! The City Council of People’s Deputies would like to inform you that, due to an accident at the Chernobyl nuclear power plant in the city of Pripyat, adverse radiation conditions are developing. Necessary measures are being taken by the Party and Soviet organisations and the armed forces. However, in order to ensure complete safety for the people — and, most importantly, the children — conducting a temporary evacuation of city residents to nearby localities on the Kiev region has become necessary… we ask that you remain calm, be organised, and maintain order during this temporary evacuation”
Partly out of a desire to avoid panic, and as much out of a desire to conceal what they felt was the embarrassing truth, the broadcast shared little about the danger the population was in, and misled the people of Pripyat about the permanence of the evacuation — most expected to return home soon. What the Politburo hoped was to conceal news of a possible reactor meltdown from the world beyond a thirty-kilometre zone around the plant.
As it turns out, a large cloud of irradiated gas and dust is a difficult thing to hide, and soon a Swedish nuclear plant had detected the release of radioactive material, and the world suspected an accident at a Soviet nuclear plant. Unable to hide the truth any more, Radio Moscow broadcast the following statement on Monday 28th April
“An accident has taken place at Chernobyl nuclear power plant. One of the atomic reactors has been damaged. Measures are being taken to eliminate the consequences of the accident. Aid is being given to those affected. A government commission has been set up”
This minor footnote was all the Soviet people heard about the accident, and all the outside world initially knew. Yet nature abhors a vacuum, and into this void spilled lurid tales in the Western press. The Daily Mail claimed “2,000 DEAD’ IN ATOM HORROR”, showing that it’s been full of shit for as long as it has existed. A Dutch radio station claimed that two reactors — not just one — had in fact melted down, whilst the New York Post reported that 15,000 had been killed and buried in a mass grave.
In seeking to hide the accident, all the Soviet authorities achieved was to create an empty space, devoid of any information, into which rumour, gossip and innuendo replaced the facts. This created more panic — at least in the wider world — than a sober relaying of the facts and the efforts to contain the accident would have. It undermined people’s trust towards the Soviet state, and ended up being far more humiliating than an honest response would have been.
The lesson for us
This very directly applies to a team’s incident response when a software service experiences an outage. You can’t hide the outage, and failing to communicate promptly and honestly will only undermine your customers’ trust in you and your services. Update your customers as soon as possible, and keep them updated as you work to resolve the incident.
Give your incident commanders the authority they need to resolve the incident
There was much bungling in the response to the accident, and a good deal of incompetence on show. Still, at least some of those tasked with directing the response exercised their authority in impressive ways. Ivan Silayev (Deputy Chairman of the government commission to Chernobyl) coordinated a response on multiple fronts. He requested the plant staff work on ways to pump nitrogen into the plant to smother the fire. He moved a subway construction team from Kiev to begin drilling beneath the reactor to freeze the soil with liquid nitrogen, and so protect the water table from the radioactive fuel. He also requested volunteers to go beneath the reactor to open the valves of the steam pool and begin pumping out the radioactive water.
This is in addition to the evacuation of Pripyat, the creation of an exclusion zone and the ongoing efforts to smother the blaze via helicopter. Part of this success is down to his talents in managing teams during a crisis, but another part is down to the authority invested in him (and in Scherbina) by Gorbachev. As Higgenbotham describes it:
“With a single telephone call, any necessary resources could be rushed to the station from almost anywhere in the Union: tunneling experts and rolled lead sheets from Kazakhstan; spot welding machines from Leningrad; graphite blocks from Chelyabinsk; fishing net from Murmansk; 325 submersible pumps and 30,000 sets of cotton overalls from Moldova”
What the Soviets did do well at Chernobyl was to sufficiently empower the government commission to successfully end the crisis.
The lesson for us
Incident commanders need to have considerable authority invested in them by an executive team if they are to successfully resolve an incident. Whilst the incident is ongoing, they need to be able to marshal whatever resources or people they need; if the executive team disagrees with a request, fine, but then someone else — probably the executive that disagreed — needs to take over the incident response. If you trust someone to resolve an incident, you need to trust them to know what is needed to bring the incident under control.
Fighting the problem isn’t the same as finding a solution
It’s tempting to think that the actions you take to immediately restore service after a disruption constitutes a fix to the problem. Yet getting a service back up and running isn’t the same as fixing a problem. There are all sorts of quick, hacky ways that you can improvise to restore service, when a real fix requires a considerable amount of follow-up after the incident.
And so it was at Chernobyl. To take just one of the many simultaneous initiatives to fight the immediate crisis, Nikolay Antoshkin was in charge of the air force’s effort to smother the nuclear fire with helicopter airdrops of lead, sand and boron. This is no easy feat, but Antoshkin rapidly solved many stark challenges facing him, staying up all night to do so. He brought in massive Mi-26 (“Flying Cow”) helicopters that could carry 20 tonnes per load. He then had the population of several towns help to fill bags with sand, improvised an air traffic control system and used decommissioned braking parachutes as cargo nets.
All of this constitutes some exceptional incident management work (though how much of a factor the airdrops were in smothering the fire is now debatable), but clearly bears little relationship to actually solving the problem. Just because the nuclear fire in the ruins of Unit Four was extinguished did not mean the incident was over. This would involve decontaminating the area, modifying all remaining RMBK-1000 reactors and permanently sealing the ruins of reactor Unit Four to prevent any more radioactive release.
The man tasked with the last challenge was Lev Bocharov, an engineer who directed the efforts to finish the concrete and steel coffin — or “Sarcophagus” — around the ruins of Unit Four, and allow the other reactors to resume operation. Overcoming the extreme conditions that the radiation presented, the finished Sarcophagus contained 440,000 cubic metres of concrete, 600,000 cubic metres of gravel and 7,700 tonnes of metal. And yet, this was still part of the incident response, a follow-up task that needed to be completed that provided the closest thing to a “fix” to the problem that the USSR could muster.
The lesson for us
Incident management does not end when the immediate incident is resolved, but continues until more permanent solutions and learnings can be found.
Blameless postmortems help you truly resolve a problem
Soviet physicists, working from logs kept by the reactor staff, identified the cause of the accident — the faulty AZ-5 SCRAM system — very quickly. Sadly, this truth was intentionally concealed by the chiefs of the Soviet nuclear ministry, who over the course of many meetings tried to divert blame away from themselves and their designs. The initial government report, overseen by the Soviet nuclear ministry (the euphemistically-named “Ministry of Medium Machine Building”) laid blame entirely on the operators (such as Dyatlov and plant director Viktor Brukhanov), who would be prosecuted in what was probably the last show trial in the Soviet Union.
Yet the Ministry of Energy created an appendix to the report, based on their own investigations, that blamed the many design flaws in the RBMK-1000 reactor, including the faulty control rods. In further meetings, Efim Slavsky (the Soviet nuclear minister) resorted to shouting down opinions he didn’t want to hear.
Nevertheless, whilst the Politburo did blame the plant operators and the lax enforcement of safety standards, it did note the design flaws in the RMBK-1000, and called out Slavsky for knowing about these failings but doing nothing to fix them. Gorbochev even praised the idea of openness, as per Glasnost:
“Openness is of huge benefit to us…If we don’t reveal everything the way we should, we’ll suffer.”
And yet, the very next day the KGB shared a list of item marked “Secret” from the Politburo’s report, and first amongst these was “Information revealing the real reasons for the accident in Unit Four”. There was no appetite within the Soviet system for openness regarding the failings of the RBMK-1000. Still, by the end of 1987, all remaining RMBK-1000 reactors had been quietly improved with, amongst other things, additional control rods and a more effective emergency shutdown system.
Eventually, then, the real causes were identified and the failings fixed (and, following the collapse of the Soviet Union, these design flaws were made public). Yet all of the finger-pointing and blame-seeking only delayed the changes that needed to be made. If the problems were known, why were these improvements not made within weeks of the accident happening? It’s entirely possible that another reactor could have melted down in the delay to implement these improvements.
The lesson for us
Seeking blame during the postmortem of an incident only creates defensiveness that obscures the root causes of the problem. If a postmortem is blameless, then it is in everyone’s interest to work together to identify what went wrong and collaborate on a fix. Moreover, blame during postmortems does not incentivise people to identify and fix problems before an incident occurs, since they can expect to receive blame for the problem existing in the first place! Blame poisons the culture, and was one of the poisons present in the bloodstream of the Soviet system as a whole.
To close, it’s worth noting how much Chernobyl contributed to the collapse of the Soviet Union, in loss of trust, loss of prestige and in the cost of the response. Companies can, do and will face software incidents that are an existential risk to their operations, so getting your response right is of critical importance.
A version of this piece was initial posted to https://medium.com/@neovenator/chernobyl-or-how-to-manage-incidents-a06f89680c3a