During the early afternoon of August 14, 2003 I hosted a tabletop exercise at my company’s New York City headquarters. The scenario we played through during the tabletop was a power outage. The tabletop was fun, went well and a lot of valuable info was shared by the participants.
Does August 14, 2003 sound familiar to you? For me, it is a day and event I will never forget. Late in the afternoon I was travelling home to Long Island from Manhattan. Uh oh, maybe you guessed it, the lights went out across the Northeast portion of the United States impacting approximately 50 million people in the Great Northeast Blackout of 2003. I like to think of the timing of the power outage tabletop and the same day real event as fortuitous.
Unfortunately, the folks in our NYC office referred to me as ‘Marty the Jinx’ for years hence. But they were so well prepared! So yeah, I will gladly ‘take one for the team’ and happily live with the nickname.
Creatively Recovering Our Time Sensitive Processes
The Great Blackout of 2003 is one of my most memorable preparedness, response and recovery stories. Our very time sensitive (critical) trading (<1 hr RTO) and customer service (<1 hr RTO) processes were located in Manhattan, NY. The many risks of being in Manhattan after the horrendous 9/11 World Trade Center attack heightened the need for robust emergency response and business recovery solutions. We spent a lot of time and effort developing and testing our strategies with the safety of our employees foremost in mind.
Our primary business recovery strategy if something should happen in Manhattan was to recover at one of our sister sites across the Hudson River in New Jersey. If we could not recovery in-house for any reason, the secondary strategy was to recover at our third-party vendor work-area recovery location. It is important to always have multiple recovery options ready and thoroughly tested.
I was in charge of designing and building out the recovery capability but I give credit to my boss, who during one of our first continuity discussions described how he had implemented a strategy at a former employer where he leveraged a lunchroom as a recovery space and had PC’s and phones on carts ready to roll into place at a moment’s notice – and he did this in the 1990’s! Always a guy ahead of his time and a great person to work for.
Tip – It is important to have a mentor to learn from. I hope I am now considered a mentor to some people.
I took inspiration from his insight and designed a similar strategy for our critical customer facing processes enabling us to recover in a conference room in New Jersey. Fortunately, we built and diligently tested our recovery strategies often. We ran through tabletops AND physically tested recovery by rolling out the carts and working from recovery desktops in the recovery conference room! Much more on interchangeable work area recovery exercises (IWARE) in another posts on UltimateBusinessContinuity.com.
As you can imagine all hell broke loose on my way home that evening in 2003. 50,000,000 people were literally in the dark, including me. At first, many of us thought this might be a large-scale cyber-attack on the electrical grid by a terrorist group. It was an understandable thought as we were in NY and the recent events of 9/11 were ingrained in our thoughts every day, as they always will be. We learned this event was not terrorist related, although I firmly believe we will face an attack on our infrastructure at some point from a terrorist group or nation-state. The Northeast blackout was caused by human error and a series of cascading infrastructure events escalated the impact to historic proportions. Cascades accompany many types of natural and manmade events.
Tip – Always consider cascading events when planning. Unfortunately, in some high-profile recovery mishaps organizations never considered cascading impacts.
Tip – Human error triggering disruptive events is not unusual. Back in my IT days I learned to be very careful when making changes to programs or the network, especially late in the day when I was tired or in a rush. Software developers are smart not to commit changes to production late in the afternoon on a Friday. That had a way of wrecking a weekend.
The evening of the 2003 power outage I got home at 5 pm. I had scheduled the tabletop early in the day so I could get home early for some much-needed R&R. Well, as you are well aware in our profession we are on call 24x7x365 and duty called that night and for the following week. But that is what we BC Professionals live for! I always say, ‘if I do not step up and get our company though a crisis – FIRE ME! ‘ It is where the ‘rubber meets the road.’ Tip – use my FIRE ME line in interviews (with passion) and you will have a good chance of getting the job – but you must be prepared to live up to it.
Ready – Set – Go!
On the first night of the Great Blackout I opened our crisis team conference line. Every member of our experienced crisis response team stepped up. We kept that line open for 72 straight hours. It was probably the only time since my college days or my big-time coding years in the 1990’s that I worked 72 hours straight, with only 3 quick naps. Tip – Keep lots of coffee brewing!
Throughout the event the importance of communication was driven home. We had active participation from the Incident Command Team (ICT), Emergency Operations Team (EOT) and critical IT and telecommunications partners throughout our enterprise. So many fruitful discussions focused on various scenarios we expected to play out. Situational awareness information was so important and it flowed to the people that needed it as the situation escalated.
Everyone stepped up. If you have ever been in the middle of a crisis situation you know it can be a bit surreal and you might agree that time seems to either fly by or go in slow motion. Information had to be digested and big decisions had to be made in an intelligent manner.
By 2 am the Mayor of New York (NY) was ‘hopeful’ of getting train service going by the morning rush hour. By 3:30 am it was far from a sure bet. Our team weighed our options. Wisely, our Incident Commander made the decision to have our facilities in New Jersey (NJ) begin readying the business recovery rooms in the event mass transit was not running and we had to recover there.
Facilities and technical employees in NJ reported to the recovery site and rolled-out the PC’s that were on the carts. They fired them up to insure they connected to the network, checked that the latest image and patches were in place (we updated the images often so there were no issues). They wired up and tested the phones (wireless could not be used for sensitive trading calls). We were ready to work!
Simultaneously our ICT and ECT teams assimilated all of the information available to us and we reached out to public agencies we had relationships with to get news as it happened. We were fully prepared to recover in NJ or work from NY if the situation changed for the better at the last minute. We were on partial generator in NY but transportation was a factor to consider. At 5 am we made the final decision to enact our plans and recover our critical NY processes at our now ready NJ campus. As I mentioned, we had practiced for this type of scenario many, many times which enabled us to close many potential gaps prior to this event. Practice, practice, practice and a great team gave us confidence we would make this work.
We contacted teams at the recovery site and let them know that recovery personnel would be arriving by 8 am.
Process owners had communicated with employees throughout the evening and early morning hours. They utilized their up-to-date well tested call trees they kept at home. We knew everyone that lived near our locations in NJ from detailed mapping information in our plans. At the time, everyone had land lines which had power from the phone company. Today we would have used additional multi-modal means to communicate – more on that in the mass communication post.
Impressively 100% of the trading and customer service recovery employees in New Jersey were contacted and made it to work by opening bell that morning.
Telecommunications re-routed all customer facing toll free numbers (documented in our internal phone routing tables). So, all calls that were destined for NY were now ready to be taken in NJ! We had tested this often.
I kept the conference line open throughout the event. It was our focal point to smooth out any speed bumps we encountered during the day. For example, the need to gather a few more headsets, replace a couple of dead batteries, get additional trading forms – all things we had available at the recovery site as part of our plan. Low-hanging fruit because we kept extra supplies on hand. During a crisis, not everything works perfectly so you have to be resilient and be able to adapt. You should build and promote a culture of resilience. I promise, it will serve you well!
At the end of each day we did a recap/hotwash meeting (we used to call it a post-mortem – but eventually we shied away from that term as it sounded sort of negative) to see what went right and what could be improved for the next day and the next inevitable event. We documented all issues and opportunities for improvement in an issues log, tracked and closed them one-by-one. You really do learn from every event. The key is to act on what you learn. Unfortunately, industry research shows we do not always act on those ‘lessons learned’ opportunities. That should not happen. Acting on them and fixing them for the next time really pays off.
On day two of the Great Blackout we again worked from NJ and finally on day three we could begin returning to our primary production site in Manhattan. We did this in a couple of waves of employees rather than all-at-once just in case the Manhattan power infrastructure was not stable. You probably can relate that power has a way of going up and down – if you have ever been through one of these events.
I am so proud that our company responded so professionally and effectively to such a difficult challenge. I attribute it to a culture of resilience, the dedication of management, professionalism of employees, communication, teamwork and testing. Everyone knew their role and performed it well. No egos – no glory mongers. It was a stressful week but also the best feeling as we were ‘zoned in’ as a team throughout the event.