Our Reach (Usually) Exceeds Our Grasp

In his insightful book that was published in 1984, "Normal Accidents", Charles Perrow lays out how many modern complex and/or interconnected systems designed by humans fail in myriad ways due to causes that were either not anticipated or were dismissed as so improbable that the designers needn’t worry about them. BrainCircuitBoard

Engineers are trained to build redundancy into critical systems. Unfortunately, the reasoning behind dual or triple redundancy is often undermined by some hidden (faulty) assumption that failure modes are independent. For example, the 1994 crash of a Boeing 737 was caused by a failure within a hydraulic power control unit (PCU) that was assumed to have internal redundancy. It was later determined that it had a failure mode that made it a single point of failure however. This PCU may have also been the cause of other 737 accidents and another temporary loss-of-control incident.

More recently, the meltdown at Fukushima seems to be the result of a failure to anticipate a tsunami of the magnitude that actually occurred together with rather unwise placement of backup generators. These failures were clearly not independent and the risk of a large tsunami was also higher than anticipated. As far as I am concerned, Nassim Nicholas Taleb’s first book, “The Black Swan” should be required reading at every engineering school - and every business school.

Additionally, unintended consequences are frequently present in human built systems. Think of the ‘flash crash’ that occurred in 2010. It appears that regulators either didn’t comprehend the risks associated with high-speed trading systems or experienced ‘regulatory capture’. It’s not clear whether the people (these are really smart people) who designed these trading systems understood the risks or chose to ignore them. The causes of this event are still controversial but for me, the lesson here is that when complex systems are themselves connected, the result is often incomprehensible. This is a classic case of “Our reach exceeds our grasp” and of a system which is fragile in the sense described in Taleb’s more recent book, “Antifragile: Things That Gain From Disorder”.

Bringing the discussion a bit closer to home, the recent data loss event at Target follows a similar pattern. While all the details are not public, we think we know that several failures contributed to the breach. First a vendor’s credentials were compromised (failure 1) leading to privilege escalation (failure 2) without being detected (failure 3), leading to network access to network-connected Point of Sales devices (failure 4), leading to compromise of the integrity of the POS devices (failure 5), leading to the installation of an exfiltration executable which was detected but ignored, (failure 6), leading to the eventual exfiltration of credit card information (failure 7). If and when the final facts are made public, I would not be surprised if additional failures come to light.

Neither Perrow nor Taleb deal heavily with malicious human intent in interactions with complex systems. Whether motivated by potential financial gain, or to cause terror & disruption, malicious people are capable of making what are assumed to be independent failure modes no longer independent. This is what makes defense so difficult. While the design of an aircraft or nuclear power plant system is undertaken by engineers with great expertise and attention to safety, mistakes are still made that cause catastrophic failures even without malicious human intervention.

The IT infrastructure of most companies grows in a semi-planful fashion. Staff training is inconsistent and it is trivial to interconnect systems (even without knowing it) that should never be interconnected. Shadow IT can allow an organization to be responsive to business needs, but at the possible cost of making the entire infrastructure more fragile. Latent defects that would never be discovered without malicious actors are exploited on a regular basis. IT organizations are too often judged only on new functionality delivered and not whether the infrastructure is secure. It should be no surprise that we see so many publicly reported breaches.

All this brings me to this FAA document I came across a few days ago. It seems that Boeing wants to connect the internal (flight critical) network in some 777 models with the passenger service network. Both Boeing and the FAA seem to understand the serious consequences of a security breach here. I have a lot of respect for both organizations and really hope they get this right. However, I’m not sure I would trust anyone to do this right...