Flawed Assumptions in Finding Bugs
This month, I move from strategies for duplicating bugs to strategies for finding bugs. But before we discuss specific strategies, we must ferret out many of the flawed assumptions that will bite us in implementing these strategies.
Bugs can easily creep into our software based on wrong assumptions. How many degrees does the Earth rotate every 24 hours? 360°! If that’s your assumption, then you can join the software engineers who caused the 1965 Gemini 5 mission to fall 80 miles short of its intended splashdown (Figure 1). The correct number is 360.58°.
And how many times have we made assumptions about the real-time load on our software, only to find that the real world often wreaks havoc with our load assumptions? It did with the 1997 Mars Pathfinder. In pre-flight tests, the engineers noticed a condition of priority inversion causing system resets. But they assumed that the real world would never present such loads to the RTOS. Whoops! It almost scuttled the mission once it was on Mars. Thank God for remote software updates.
And how about the F-22 Raptors (Figure 2) that suffered multiple computer crashes every time they crossed the 180th meridian of longitude? Did someone make a wrong assumption there?
Of course, there are many faulty assumptions that can affect our designs. But in this article, we’ll focus on some faulty assumptions that hinder us from finding bugs in embedded systems. A difficult-to-find bug surfaces, and we spend hours or even days looking for it. Sometimes, we can’t find it because we’ve made assumptions about where it can’t be. Let’s see what some of these assumptions are.
JUMPING TO EARLY CONCLUSIONS
Once, when I was wet behind the ears, the company I worked for sent me to the field to fix an intermittent bug that shut the machine down about once per week. After several days of futile investigating, I decided to do a visual inspection of the wire wrap build. Back in those days, backplanes were wire-wrapped (Figure 3). Since this was done by an automated machine, and we had many others in the field, it seemed unlikely to be the problem. But, after much intense effort, I found an off-by-one error in the wiring with a suppression capacitor. Elated, I fixed the wire wrap and restarted the machine.
Flying home, I felt like a hero. Who else could have found that miswiring in the maze of wire wrap? Who else would have stared for hours at the wire-wrapped backplane? A week later, the problem repeated itself. I was devastated. I was sure that I had found the problem.
Re-examining the design, I observed what the circuit looked like when miswired. An output had a suppression cap with no limiting resister on it that wasn’t supposed to be there. This weakened the totem-pole TTL output such that the failure was not caused by the wiring. That is why it didn’t fail for several months with the miswiring. But the miswiring caused the totem-pole TTL output to become weak which then became susceptible to noise. The result was an occasional failure of the machine. Correcting the wiring did not repair the TTL output.
Flawed Assumption #1: “The manufacturing process is automated and repeatable, therefore it must be flawless.” When debugging, we can get blinded by the thought, It cannot be that.
Flawed Assumption #2: “The first discovered problem is the core problem.” When debugging, we need to make sure that the first problem we find is the real source of the bug.
THE MANUFACTURING PROCESS
One time, we had a system that was failing about once per month. We had scores of these systems in the field which never manifested the problem. We eventually tried replacing every board in the system, but to no avail. We went so far as to replace the entire system with a new one from the factory. Yet the same problem happened. It wasn’t until we went back and examined the manufacturing process that we discovered that the OEM (Digital Equipment Corporation) no longer shipped their PDP-11 computer interface units with the bus grant jumper card in them—but sent the card in a separate box with a wire in it. Our manufacturing folk did not know what to do with the extra wire, so they inserted the board and threw the wire away. Previously the board was inserted with the jumper installed.
Here is some background. Although there were four priority interrupt lines in the PDP-11, there was only one bus grant line. Once an interrupt was requested on one of the four interrupt lines, the CPU needed to issue a bus grant signal to the card for it to take over the bus. The single bus grant line was daisy-chained in the PDP-11 bus. A bus grant signal on one card needed to pass through to each board until it got to the CPU. If a backplane slot was not used, a bus grant jumper card needed to be inserted in the empty slot.
But wait, didn’t we replace the entire system with a new one? When we replaced the system, the technicians saw a PCB with no wires on it, assumed it did not need to be replaced, and put the original jumper board without the jumper wire into the new system.
Flawed Assumption #3: “The manufacturing process has not changed.” Verify that it has not changed.
LIBRARY AND TOOLCHAIN VERSIONS
In this day of embedded systems development, we use a vast number of libraries and extremely complicated toolchains. And they’re powerful and robust, but they’re not perfect.
Once we used an operating system that ran on top of DOS. It provided some pretty good multitasking that was useful for our design of a take-out mechanism. After hundreds of hours of operation, the system would occasionally miss taking one bottle out of the machine, resulting in a pile-up of all subsequent molten bottles. What a mess. We spent days trying to find what was wrong with our software. The problem proved to be in the library.
Flawed Assumption #4: “The libraries and toolchains have not changed and are perfect.” Verify that they have not changed, and don’t assume that they are perfect.
I knew a pretty gifted embedded software designer. He was smart and a quick study. He also had a lot of hubris toward his own work. He often looked everywhere but in his own code for a problem. Gerald Weinberg’s groundbreaking classic, Psychology of Computer Programming, was an eye-opener for this designer. Weinberg’s concept of egoless programming was both enlightening and challenging to him. But it shaped my career more than any other book because that designer was me.
Flawed Assumption #5: “The bug cannot be in my code.” In his famous work Thinking, Fast and Slow (Figure 4), Daniel Kahneman explains how cognitive bias can influence our decisions. I discussed these before in my article “Estimating Your Embedded Systems Project (Part 3)” (Circuit Cellar 297, April 2015)  to show how bias can affect our software development estimates. As part of becoming proficient at debugging embedded software systems, I would encourage you to read that book carefully to uncover some of the biases you bring to your debugging.
I had a designer work for me for many years who never trusted his own code. It was difficult for him to release his code to test, and difficult for him to think that the bug was any place other than his own code. This bias can cause as much trouble as programmer hubris. Most bugs are equal-opportunity!
Flawed Assumption #6: “It must be in my code.” This assumption can frequently get you in trouble. Let that be a reason for some self-confidence.
Many years ago, my business partner and I were tracking down an elusive bug (that turned out to be a 16-bit to 32-bit porting issue). But like hurricanes in Hartford, Hereford, and Hampshire, this bug “hardly hever [sic] happened.” The company patent attorney was a big supporter of ours (one of our patents won the company millions!), and he would come down every day to ask how we were doing. I must admit that after a week or so, we were feeling discouraged. We weren’t getting anywhere. Our inner Eeyores kept thinking, Days, weeks, months—who knows? Or even, as Eeyore famously espouses: It’s all for naught.
One morning, Spencer, the attorney, came down to the lab and announced that he had a dream the previous night that we had found the bug. Don’t ask me how it happened, but that day we found it. I think a big part of it was turning the Eeyore off in our mind, and knowing that the bug really could be found. We thought, Why not today?
Flawed Assumption #7: “I’ll never find the bug, and this is hopeless.” Although software designers often have a positive view of their own designs, sometimes after weeks of effort with nothing to show for it, we can get discouraged. This will definitely affect your ability to find the bug, and will thus become a self-fulfilling prophecy.
THE UNDERLYING DESIGN
There is a great story in The Embedded Muse #464 about a designer who thought he knew the underlying design of some RAM in an ASIC chip and used a standard algorithm for testing RAM. When they started getting an unusual number of errors in the RAM of this expensive ASIC, they were forced to re-examine their assumptions in the RAM test algorithm about the underlying design. They found that those assumptions were wrong. The standard RAM test marches ones and zeroes through memory. But to find a failure, this assumes that the logically adjacent bits are physically adjacent. This ASIC had all of the odd bits in one RAM and all of the even bits in another RAM.
Flawed Assumption #8: “I know the underlying design of the hardware/software I’m using.” Learn to recognize and document when your algorithms depend upon an underlying architecture.
NOT DESIGNED HERE
Most of the time, we work with hardware that is highly complex (with 3,000-page datasheets) and voluminous, complex software packages. Then along comes a bug so mysterious, in a system with code that we know like the back of our hand, that we assume (perhaps subconsciously) it must be in the hardware or software package we didn’t design.
The results are like the programmer hubris assumption I outlined before. But in this case the problem isn’t hubris—rather, it’s the sheer complexity of the designs we work with. We once designed a product that could be used to ferret out power-up issues. It proved to be highly successful in finding bugs for other companies. According to an engineer at Apple, it saved them millions of dollars by finding a bug right before they released a board for production. About three months after we released the product, we started getting bug reports. The bugs manifested themselves during power-up. We didn’t test our product on itself! We assumed that any problems must be in someone else’s hardware or software.
Flawed Assumption #9: “The bug must be in the hardware we didn’t design.”
Flawed Assumption #10: “The bug must be in the software we didn’t design.”
THERE ARE MILLIONS OF THESE IN PRODUCTION
Sometimes we assume that, since there are millions of these chips in production, the problem must be in our own software or hardware. We once had a system that we could not get through rigorous military EMI testing. No matter what we did in the design, we could not eliminate the problem. We were convinced it must be somewhere in our hardware or software. Instead, the vendor (Intel) eventually admitted that it was a hardware problem in their chip (of which they produced millions).
We had a similar problem with an Atmel design that we struggled to get through EMI. After months of debugging, we finally got it to squeak by. But the design was so clean by then that we concluded that the Arm-9 chip itself was the problem. A similar design with a different chip proved to us that our design was not the source of the EMI—it was the chip itself (of which they produced millions).
Flawed Assumption #11: “Since there are millions of these chips in production, it cannot be the problem.” Big manufacturers make mistakes, too.
It is truly amazing the extent to which assumptions can blind us. In this article I have covered just a small selection of the many that have bitten us, before. Doubtless you have some of your own that perhaps you would like to share with the Circuit Cellar audience. If so, email me and I will try to get them in a future installment.
We are on the home stretch. Next time I’ll wrap up this series and look at specific strategies that I’ve found helpful in debugging embedded systems. But, of course, only in thin slices.
 “Estimating Your Embedded Systems Project (Part 3)” (Circuit Cellar 297, April 2015)
Psychology of Computer Programming by Gerald M. Weinberg – a wonderful treatise on how the way we think affects the way we program.
Thinking, fast and slow by Daniel Kahneman – a great book containing many of the surprising biases that affect our daily life.
Eeyore witticisms https://news.disney.com/12-amazing-witticisms-from-eeyore A great place to see if there is an Eeyore in the mirror.
Embedded Muse 464 http://www.ganssle.com/tem/tem464.html Check out the details of the ASIC RAM test. Also, subscribe if you haven’t.
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • JUNE 2023 #395 – Get a PDF of the issueSponsor this Article
Bob Japenga has been designing embedded systems since 1973. From 1988 - 2020, Bob led a small engineering firm specializing in creating a variety of real-time embedded systems. Bob has been awarded 11 patents in many areas of embedded systems and motion control. Now retired, he enjoys building electronic projects with his grandchildren. You can reach him at