CC Blog Design Solutions Research & Design Hub

When Technology Goes Wrong

FIGURE 1 The fire I built during my "lost-in-the-mountains" experience.
Written by Stuart Ball

How Bad It Can Get, and What You Can Do About It

It has been said that there is no such thing as an accident, and that most mishaps and disasters actually are predictable and preventable. Technology is only as good as its design—which should incorporate forethought about what could go wrong. In this article, Stuart looks at how lessons from past technology failures from faulty indicators and designs—including one he experienced firsthand—can prevent future problems and disasters.


  • How can engineers guard against technological errors?
  • What is failure mode effects analysis?
  • What are important considerations in protecting systems from catastrophic failure?
  • Sensors
  • Embedded Systems (general)

In 1973, there was a movie called Westworld, about a robotic amusement park. The movie trailer said, “Nothing can possibly go wrong. Go wrong. Go wrong…” Of course, the entire movie was about things going murderously wrong.

My article will be unusual, in that there are no schematics, and there is no code on the Circuit Cellar website. It is about what happens when technology goes wrong, and how we can prevent or deal with such situations in our designs. I’ll start with a short version of my own potentially deadly experience with technology that went wrong.

MY ACCIDENTAL CAMPING TRIP

I live in Colorado. In March of 2015, I went snowshoeing at an area called Brainard Lake, which is at about 10,000 feet altitude. I had planned to go to a different area early in the day, but it was inaccessible due to snow. So I went to Brainard in the afternoon. I set a GPS waypoint (a geographical position with specific latitude and longitude coordinates), where I parked the car.

I had not been to Brainard before, so I was planning a short familiarization hike, maybe an hour or so. The only other people I saw all afternoon were some cross-country skiers, and they were long gone by the time I got my snowshoes on. I’ve never had a good sense of direction, so when I decided I’d gone far enough, I wasn’t sure of the direction to the car. I should have backtracked on the trail and returned that way. But it was late, so I decided to bushwhack to the car in a straight line using the GPS. But the GPS waypoint was bad, and it led me in the wrong direction.

By the time I reached the waypoint and realized I was lost, the shadows were getting long, it was getting cold, and I had put back on the layers of clothing I had taken off. I considered hiking back, but the area I’d hiked was all trees and deep snow—difficult terrain, and no place to stop if I ran out of daylight. The place where I ended up had a tree with dry ground, surrounded by elk droppings. Presumably some elk had recently spent the night there and melted the snow. I decided it was safest to stay there.

It was nearly dark, and I spent the little time I had tearing down pine branches for insulation and collecting firewood. It got very dark—trees were just vague shapes. I built a fire but didn’t have enough wood to keep it burning through the night. (If you’ve never done it, you’d be surprised how much wood that takes.) Figure 1 is a picture of the fire as I was getting it started, you can see my water bottle packed with snow for melting. My gloves were wet, so I took them off. They froze solid overnight.

FIGURE 1
The fire I built during my "lost-in-the-mountains" experience.
FIGURE 1
The fire I built during my “lost-in-the-mountains” experience.

I didn’t have cell phone signal, and I couldn’t even send a text message. (Safety tip: a text only needs a brief connection to transmit, and will often be sent even if the connection isn’t stable enough for a voice call.) I decided that since I was well away from where I should have been, searchers might not look there for a while (later confirmed by one of the rescue searchers). So, despite the usual advice to stay put, I decided to follow my tracks back in the morning. If snow covered my tracks during the night, I’d follow the bad GPS heading backward until I intersected the trail, and as a last resort, I’d follow my compass.

I started back about a half hour before daybreak with my headlamp. By the time I reached the more difficult terrain, there was enough sunlight to see. Searchers had a helicopter looking for me. I could see them, but they couldn’t see me, because I was in a wooded area.

I eventually encountered one of the dog search teams and we returned to the trailhead. One of the rescuers said they had done rescue searches in conditions less severe than mine and the outcome wasn’t good. The sheriff’s report said I became disoriented, found a dry place to spend the night, constructed a shelter, and built a fire. “Shelter” is a bit of a stretch; it sounds like I built a log cabin or something.

I know what I did that got me lost. I started out late, leaving me no time to recover if something went wrong, I got off-trail in an unfamiliar area, and I followed the GPS into difficult terrain. One of the rescue searchers, who was on cross-country skis, said she was impressed with some of the terrain I covered; I told my wife later that I had no intention of covering any terrain that would impress those people.

I admit that I had made some bad decisions. But if the GPS waypoint had been correct, none of my bad decisions would have mattered, because it would have taken me right back to my car. That GPS model isn’t supposed to set a waypoint if it doesn’t have a good fix on the GPS satellites. But this one time, it did, without reporting any error or warning.

I worked on GPS development back when it was an all-Department-of-Defense (DoD) program, so I have some knowledge of how it works. The GPS system is very reliable; what it does is technically complicated, but conceptually very simple. But the GPS receivers we use are not always as reliable. I contacted the manufacturer to see if there was a firmware update for my unit, but never received a response.

In my case, the GPS didn’t tell me (or didn’t know) that the waypoint was bad. Fortunately, even with my mistakes and the faulty GPS indication, it ended well enough. But there was another situation where a faulty indicator caused a nuclear meltdown.

SENSOR FAILURE: REACTOR MELTDOWN AT THREE MILE ISLAND

The Three Mile Island (TMI) nuclear reactor was completed in 1978. In 1979 it suffered a partial meltdown of one reactor (Figure 2). The Nuclear Regulatory Agency identified several causes of the incident, including inadequate operator training and incorrect procedures. There are a number of articles about TMI, including one on Wikipedia [1]. I won’t go into all the details; rather, I want to look at one small piece of the problem. One thing stood out to me, that, like my wayward GPS, could have mitigated the entire scenario.

FIGURE 2
Newsmen and spectators stand in front of the main gate of the Three Mile Island Nuclear Generating Station in Middletown, Penn., April 2, 1979.(Photo Credit Jack Kanthal/The Associated Press)
FIGURE 2
Newsmen and spectators stand in front of the main gate of the Three Mile Island Nuclear Generating Station in Middletown, Penn., April 2, 1979.(Photo Credit Jack Kanthal/The Associated Press)

In the TMI reactor, a relief valve was stuck open. However, the valve had a light indicating that it was closed. But the light actually indicated only whether power was applied to the solenoid that controlled the valve, not the actual valve position. The operators took several actions in the first few hours of the accident, based on their assumption that the valve was closed. Those actions made the problem worse, and likely turned an orderly shutdown into a disaster.

I’m not a nuclear engineer, and the plant was built in 1978, without a lot of our modern technology, so maybe it wasn’t feasible to detect the actual valve position. But like my GPS that was sending me the wrong way without warning, there was no way for the operators to know that the valve was actually stuck open.

Like my accidental camping trip, a lot of things went wrong at Three Mile Island—mostly bad decisions—but the effect was multiplied by the bad information about the position of that one valve. After reading articles about that incident, it appears to me that if the valve light had reported the actual position of the valve, the operators would have taken the correct actions instead of the wrong ones. There would still have been a reactor accident and a shutdown, but probably not a partial meltdown.

Failure of Sensors in Embedded Systems: In an embedded system, a small thing can sometimes lead to big issues. The Boeing 737 MAX experienced two fatal crashes, because a sensor sent faulty information to the flight control system, indicating that the plane’s nose was too high. In the two crashes, 346 people died. The plane was taken out of production until the problem was solved, and the entire incident cost Boeing billions of dollars.

How could such failures be prevented? In designing an embedded system, ask yourself what happens if a sensor goes bad. Can you detect it? Depending on the design, an open or shorted sensor might be easy to detect. But a faulty sensor might not. Do you have a way to detect that a sensor is sending bad or noisy data? Can you detect an out-of-range condition? And if you can’t detect it, what is the worst that can happen?

Answering that last question may require a detailed analysis, or even simulating a faulty part. I used to keep a flakey cable in my desk just for that kind of testing. You may not be working on anything that can cause deaths, but something like an industrial robot can do a lot of damage if things go really wrong. Even something like the temperature control in a soldering iron can do damage if it runs wide open.

It’s easy to fall into the trap of designing things to work when nothing is broken, but how badly can things go sideways if a sensor or something else breaks? Will the system catch fire if an essential cooling fan fails? If so, you should probably have both a fan speed monitor and a temperature sensor. What if you have a stepper motor or solenoid, like Three Mile Island, that your software controls but has no way to check? How bad can things get if the solenoid coil burns open or the solenoid gets stuck and can’t move? Can two parts smash into each other if the stepper position isn’t where the microcontroller (MCU) tried to send it? Can a cascade failure occur like the one that happened at Three Mile Island? If your stepper can actually break something, it probably needs an encoder to verify position. Do you need a way to verify that a solenoid is in the right position or that a heater is on when it should be—or off when it shouldn’t?

Ideally, sensors should be designed so that they never saturate in normal operation. For example, in a piece of equipment designed to operate up to 50°C, and using an 8-bit analog-to-digital converter (ADC) to measure the temperature, don’t make 0xFF equivalent to 50°C. Instead, scale the ADC input so 0xFF is the value for, say, 70°C. That way, if you ever see 0xFF, you know two things: the temperature is way out of range, and the sensor reading can’t be trusted because it’s saturated. It might be 70°C or it might be 150°C; you just don’t know. But you know something needs to be done, whether it’s a system shutdown or notifying a human operator. And if that temperature sensor ever does read 0xFF, don’t display 70°C for the operator, display “Overtemp” or something similar. Because, again, you can’t know the actual value of a saturated sensor.

If using a pressure sensor, can you detect an overpressure failure? And what can break if the pressure can’t be measured?

In some cases, where there are potential safety issues or equipment damage, a redundant sensor is a good solution. Yes, that adds cost, but the cost of a single field failure may exceed the additional product cost many times over.

Think of the Boeing 737 MAX as an example. The plane actually had a redundant sensor, but didn’t use it. I don’t know the technical challenges involved in integrating the second sensor, and there was reportedly a desire to minimize the need for additional pilot training for the new version, but it’s hard to imagine that the cost would exceed the human and monetary cost of those two crashes. Even if you can’t tell which sensor is working, getting wildly different results from redundant sensors is a signal that something is wrong and human intervention is needed.

A lot of industries use some version of failure mode effects analysis (FMEA) [2], which attempts to determine what happens when things fail, and the potential severity of the failure. Even with systems that are not safety-related, thinking about these situations and handling them in your design will make it more robust. A reputation for poor reliability may not kill anyone, but it can kill a company.

FAULTY SOFTWARE AND DESIGN: RADIATION THERAPY OVERDOSING BY THERAC-25

The Therac-25 was a radiation therapy machine used for treating cancer. Produced in 1982, it was controlled by a PDP-11 minicomputer. The Therac-25 had two modes of operation, one where a low-power beam was used to target a specific location on the body, and another mode where a much higher power was used with an attenuator to produce attenuated x-rays; (I’m simplifying this a lot).

Over the course of 2 years, the Therac-25 was involved in at least six incidents involving overdosing patients with excessive radiation, causing at least three deaths. There are numerous articles about the Therac-25, both on the Internet and in print publications, so I won’t rehash the details here. They are summarized in a Wikipedia article [3].

In the Therac-25, the operator would select the treatment mode, and the PDP-11 would set the beam power and rotate a turntable to put the attenuator between the emitter and the patient. The problem occurred when an operator would start the setup for one mode, then switch to the other mode. The machine took several seconds to set up the hardware, and the computer would leave the machine partially configured when the mode was switched in mid-setup. The high-power beam would then be applied to the patient without the attenuator, causing severe radiation burns.

Numerous faults were found in the software and in the design process. I want to look at just two of the design flaws and their applicability to our embedded systems.

Problem 1—Software Design and Operator Error: When asked why it’s hard to design a bear-proof garbage can, a Yosemite park ranger supposedly said that there is considerable overlap in intelligence between the smartest bears and the dumbest tourist. It’s not that bears are smarter than people, but if someone can use a bear-proof garbage can incorrectly, someone eventually will.

The Therac-25 engineers attempted to reproduce the failure. But the problem was actually caused by quick operators combined with bad software. Operators of equipment like that use it all day, every day, and they have a lot of opportunity to get very fast at routine operations. What happens to your system if the operator performs some unexpected operation? Does it recover or is it in a bad state? The Therac-25 race condition was caused by a fast operator. But it wasn’t that the computer was too slow, it was that the machine setup, being motor driven, took seconds to complete. The software design assumed the setup would be complete before the operator could change modes. It was a genuine race condition but the error window was several seconds wide.

In your system, it might not be a human operator; it could be an input from some external device, such as a speed sensor, USB interrupt, or some other event. Can an unexpected combination of events put your system into an unknown state? And can you detect it when that happens? What about an intermittent connection that produces runaway pulses much faster than the design can handle?

Problem 2—Inappropriate Code Reuse: The Therac-25 used code from earlier versions of the product. But the earlier versions had fuses and interlocks to prevent the machine from applying high power without the attenuator. The developers of the Therac-25 used the old code, but removed the hardware interlocks, trusting their software skills to prevent dangerous conditions.

In an old song, “Signs,” the singer reacted to a “No trespassing” sign by shouting at a farmhouse that they didn’t have the right to put up a fence to keep him out. There is an old saying that you should never tear down a fence until you understand why it was put up. That fence might be there not to keep you out, but to keep in an aggressive bull. The Therac-25 developers removed the interlocks that prevented exactly the kind of dangerous conditions that caused the incidents. The software bugs had existed in the earlier machine, but the hardware interlocks prevented harmful incidents.

Code reuse isn’t a bad thing. But when reusing code, especially on new hardware, it is important to analyze what can go wrong in the new system that didn’t apply to the old one. I can’t read the minds of the developers of the Therac-25, but I am confident that if there were interlocks to prevent beam activation in the dangerous configuration, those patients wouldn’t have been killed. The developers of the Therac-25 had confidence in their software skills and in the applicability of the reused code to the new hardware—overconfidence, as it turned out.

Some solenoids, heaters, and other parts aren’t designed for continuous operation and will burn out if powered continuously; such a device controlled by software can fail if left energized too long. If your software can do actual damage to the hardware or to the user, how well protected is it against such conditions, and how confident are you that the protections hold up against a “dumb tourist” who will try things you didn’t think of?

FAULTY DESIGN: WALKWAY COLLAPSE AT THE HYATT REGENCY HOTEL

In 1981, two overhead walkways in the Kansas City Hyatt Regency Hotel collapsed, killing 114 people and injuring 126 more. As with the other catastrophes mentioned here, there were extensive investigations into the causes, which are summarized in Wikipedia [4]. I want to focus on just one thing that went wrong, and how it is applicable to our designs.

The two overhead walkways were tied to ceiling supports, and then held up with hanger rods that went through the walkway beams. During construction, a request was made to change the design to make fabrication easier.

In the original design (Figure 3, left), a single support rod goes through the upper beam, which is held by a nut. The rod continues to the lower beam and is held by another nut. Thus, the rod carries the weight of both walkways but each nut carries the weight of only one walkway.

In the modified design (Figure 3, right), there are two hanger rods. One rod stops at the bottom of the upper beam and another rod connects the upper and lower beams. As a result, the nut on the upper beam, and the beam itself, have to support the weight of both walkways instead of just one. It was not designed to do that.

FIGURE 3
Construction design change at the Kansas City Hyatt Regency Hotel. This change led to the collapse of two overhead walkways, and hundreds of fatalities and injuries.
FIGURE 3
Construction design change at the Kansas City Hyatt Regency Hotel. This change led to the collapse of two overhead walkways, and hundreds of fatalities and injuries.

The design change request was not unreasonable from the viewpoint of the company making those support rods. In a complex building like that, there were probably many design changes during construction. But none of the others were fatal.

In the case of the Hyatt disaster, the problem was that nobody recalculated the stress analysis for the design change; possibly the original designer didn’t even know the change request had been made.

This goes back to the adage about tearing down a fence before you understand its purpose. Apparently nobody in the decision chain for that change even recognized the problem or tried to check. In a building like that Hyatt, there are thousands of bolts, tons of concrete that must be poured to specification, miles of electrical wiring, and hundreds of electrical fixtures. After an incident investigation, when the cause of the failure has been found, it’s easy to say someone should have checked; hindsight always has 20/20 vision. But every design change can’t result in a complete reanalysis of the entire building—how would someone know to check the stress on that specific nut?

— ADVERTISMENT—

Advertise Here

If your embedded system hardware or software has a safety-critical element, how does anybody know it’s important? You could add a comment to the code: “Safety issue—do not change without analyzing ‘xyz’.” But is that adequate?

In a similar way, how are safety features or limitations that prevent device damage retained in your code or in the hardware documentation? As I mentioned, some solenoids will burn out if current is applied too long. Does your code have fail-safes to prevent these events? How do you keep someone from changing them? Someone changing the code years after original development may not even be aware of the duty cycle limitation on the solenoid.

My Experiences with Faulty Design Changes: I worked on a project years ago that involved a rotating drum. Someone changed the drum material, presumably to reduce cost. But the original drum was made of conductive plastic, whereas the new one was ordinary plastic. The result was an unintentional Van de Graaff generator that could produce arcs up to a quarter inch long—right into the control PCB. Nobody was hurt, but the hardware sure didn’t like it. How did that person make that decision?

I worked on another system where guys in another group, in another country, had difficulty getting a part. Their vendor recommended a replacement, and they accepted it without asking me, though I had designed the PCBA and firmware. The boards seemed to work, but we got a lot of failures in the field. Nobody was hurt, but a bunch of boards had to be replaced in the field, and erroneous results from those boards cost the company money. The alternate part worked well enough to pass board-level test, but didn’t work over the full range of conditions encountered in the field.

When I looked into it, it was obvious what had happened. The replacement part, a current-sensing resistor, was the right size and type, but the wrong value. I had put notes on the schematic that included the calculation for the part, and those notes should have made it obvious not to change the value. I’m guessing that the engineer who approved the change didn’t even look at the schematic.

Thus, in an exact parallel to the Hyatt disaster, an engineer trusted the recommendation of the vendor without checking with the designer (me) or the documentation. The Hyatt disaster happened essentially because an obscure but potentially hazardous part of the design was changed with no review or accountability.

Could something like that happen in your designs? How does anyone in your organization know to evaluate such a change, and who has the authority to require that such an analysis be performed?

CONCLUSION

There have been many incidents and disasters that we can learn from. I’ve only looked at a few here, and these are relatively notorious in the engineering world. Many safety lessons have been gleaned from those occurrences, and standards have been created to accommodate those lessons. Hazard and operability (HAZOP) analysis has been around since the 1960s, and use of FMEA started in the 1940s with the U.S. military. Yet, the Boeing 737 MAX crashes happened despite all the lessons learned from earlier disasters, and with all those new processes and standards in place.

No matter how many lessons we learn and standards we create, humans are a critical link in the development process. With the complexity of modern systems, knowing how one failure affects the entire system isn’t always obvious. The important point is to make sure you understand how things can go wrong and how badly they can go wrong, and to design with that in mind.

Hopefully, this has given you some ideas of what to look for in your own designs—so you can avoid having your or your company’s name in the press after some high-profile disaster happens! Because things do go wrong. Go wrong. Go wrong… 

REFERENCES
[1] Wikipedia article about the Three Mile Island nuclear reactor disaster: https://en.wikipedia.org/wiki/Three_Mile_Island_accident
[2] American Society of Quality FMEA overview: https://asq.org/quality-resources/fmea
[3] Wikipedia article about the Therac-25 radiation therapy machine (overexposure incidents): https://en.wikipedia.org/wiki/Therac-25
[4] Wikipedia article about the Hyatt Regency Hotel’s overhead walkway collapse: https://en.wikipedia.org/wiki/Hyatt_Regency_walkway_collapse

PUBLISHED IN CIRCUIT CELLAR MAGAZINE • OCTOBER 2023 #399 – Get a PDF of the issue

Keep up-to-date with our FREE Weekly Newsletter!

Don't miss out on upcoming issues of Circuit Cellar.


Note: We’ve made the Dec 2022 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Sponsor this Article
+ posts

Stuart Ball recently retired from a 40+ year career as an electrical engineer and engineering manager.  His most recent position was as a Principal Engineer at Seagate Technologies.

Supporting Companies

Upcoming Events


Copyright © KCK Media Corp.
All Rights Reserved

Copyright © 2024 KCK Media Corp.

When Technology Goes Wrong

by Stuart Ball time to read: 17 min