Basics of Design CC Blog Research & Design Hub

Debugging Embedded Real-Time Systems 

Figure 1 The Oxlife Independence portable oxygen concentrator
Written by Bob Japenga

Strategies to Determine if a Problem is in HW or SW

This month, I continue my series on debugging embedded real-time systems. In this article, I will look at the challenges in determining whether a problem is in hardware or software—and perhaps a few strategies to deal with those challenges.

  • How do I debug an embedded real-time system?
  • How can I tell if a bug is in the hardware or the software of an embedded real-time system?
  • What are some strategies for dealing with hardware and software bugs in embedded real-time systems?
  • Embedded real-time systems
  • Microchip PIC32MZ1024EFG144

Debugging embedded systems is hard work. And it doesn’t get any easier when you don’t know if the bug is in the software or the hardware. One website which intends to help you debug your computers boldly states: “A slow computer is always a result of software, not hardware problems…” Not so in PCs, nor in embedded systems. Yes, the software in an embedded system might have bugs that are causing it to miss deadlines or become sluggish, but those issues could also be caused by faulty hardware. How can we tell? This month I will outline five strategies for helping us deal with bugs that could be in either software or hardware.

Strategy #1—When Nothing Makes Sense

Figure 1 The Oxlife Independence portable oxygen concentrator
Figure 1
The Oxlife Independence portable oxygen concentrator

A few years back we were porting a large portable oxygen concentrator (Figure 1) to create a wearable oxygen concentrator (Figure 2). One of the ways we lowered the cost and size was to reduce the number of microcontrollers (MCUs) in the system from four to one. So, the one MCU needed to be more powerful than any of the previous processors. We chose an MCU from the Microchip PIC32 MZ family (PIC32MZ1024EFG144). The port was fairly straightforward. We were moving code we had written for the other processors into the new PIC32. We knew the design like the back of our hand. Functionality was mostly being reduced, so we were removing code. The software team reported we were nearing completion. Then one day, the device (still in development) was chugging along and filtering (mechanically) the oxygen from the outside air through a sieve bed, when it crashed and rebooted.

Figure 2 
The Oxlife Freedom wearable oxygen concentrator
Figure 2
The Oxlife Freedom wearable oxygen concentrator

I won’t bore you with all the details, but we did everything we could to find out what was happening. I will talk about some of these techniques in a later article. After somewhere between one and 30 minutes, the software would crash and reboot. This PIC has registers that told us that it was not the power supply dipping (by the way, that is a great feature of many MCUs). The trapping mechanism we employed did not indicate that the same code was running when the crash occurred. We knew that an illegal instruction was being encountered, but always in different places. Everything we knew about determining whether a problem is in hardware or in software told us that this problem was in hardware—but we couldn’t be sure and couldn’t prove it. The information we had was:

  • It crashed at different locations in the code.
  • Changing the code’s timing and memory layout caused the problem to occur less frequently—but not go away.

We worked closely with Microchip’s Field Applications Engineer, and he continued to assure us that the chip did not have any problems like we were reporting. Our code was so tied to our I/O complement that it was not easy to send them our code without sending them our entire hardware setup. And of course, when we did strip out all the I/O, it never failed. 

We were forced to use one of the most powerful debugging techniques that I know in dealing with intermittent errors: divide and conquer. We began stripping out code and seeing if it failed. No matter what code we took out, the problem would go away. If we took out the DMA processing, the problem went away. If we put the DMA back in and took out the digital I/O processing, the problem went away. If we put digital I/O processing back in and took out the A/D processing, the problem went away. If we only stopped the I2C bus, the problem went away. On and on this went, without us being able to isolate the problem.

Now some background: In this MCU, the I/O and the peripherals are almost completely configurable. At one point we calculated that you could configure over a billion different combinations of I/O and peripherals. We ultimately found that the particular combination of I/O and pins that we were using caused one of the internal DC rails to go down when we were reading the internal reference voltage with the A/D. Removing any I/O reduced the load, and the system wouldn’t crash. When the DC rail went down, it caused the processor to fetch a bad instruction and the software crashed. Note: This “feature” is not in the chip’s errata (yet). 

Thus, this experience gives us our first strategy to determine if a bug is in software or hardware: When the dent in the wall next to your desk has more than a 3/4-inch depression, the problem is in hardware. A better way to phrase this might be: The problem is in hardware when nothing else makes sense.

Strategy #2—Develop Hardware Test Software

One of the challenges in developing embedded systems is that often we start debugging new software on hardware that is just as new. It’s the nature of the design process. Here is one of the most helpful strategies I used in my career to determine if a problem was in hardware: I (or someone else) would write software to test the hardware. 

Keep this software as simple as possible—with as few interrupts as possible, no RTOS, and as small as possible. Yet, with that in mind, attempt to use the hardware in the same way that you are going to use it in the application. These conflicting requirements can be a real challenge, but they are both important. So, keep it simple, but not too simple. For example, if the I2C software is going to be using the hardware’s I2C controller, your tests must use it.

Keep the test software in standalone modules that can do repetitive testing. I don’t care to mention how many times some hardware worked when I exercised it once, but failed when I exercised it on the 10,000th time.

Don’t have the software just repeat the same function over and over, but do allow portions of the code leading up to the final stage of the output to be repeated. We used to call these “scope loops,” although no one seems to call them that anymore. They allow you to look at rise times or latencies that you could not pick up if you just did the whole sequence end-to-end. 

This code is useful not just when you are first integrating with the hardware, but also when changes are made to the hardware. So, keep it up to date. Does it still pass the hardware test? It is also useful when you know that the hardware design works, but the board that you are using may have a hardware failure—not a design failure. Being able to whip out the hardware tests independent of your application code will be incredibly useful.

Strategy #2 is to have diagnostic software that verifies the hardware is working.

Strategy #3—Analysis-Based Instrumentation

Once, we had a system that started dropping keyboard input after it was in the field for a few months. The user would push a key and nothing would happen. Our customer desperately wanted the problem to be in the software since he had 10,000 of them in the field with just one of his customers. He could easily update the software, but bringing all of the units back to fix the hardware would cost him a pretty penny. How could we demonstrate to him that it was not in the software, but in the hardware? 

Our first strategy was to analyze all the possible ways (that we could think of) that the software could manifest the problem. This customer’s manager desperately wanted it to be in software. And he provided some completely out-of-the-box possibilities. Then we needed to determine the earliest point in the software chain where it could happen. In our case, since the keypad hardware used a keyscan matrix circuit, reading each key involved more than reading one discrete input. In fact, it required both discrete outputs and discrete inputs. A keyscan matrix circuit is used in almost any keyboard of at least moderate size because it cuts down on the number of wires needed. Imagine your PC having a discrete wire coming in from every key on your keyboard. It would take over 100 wires and 100 inputs. With a keyscan matrix, 12 wires can be used to read 144 discrete keys. (Figure 3 demonstrates how eight wires can be used to read 16 switches.) In our case, we were able to instrument a failing unit and demonstrate that the outputs were going out, but the inputs were not coming back. If the instrumentation had instead shown that this was working, we would have gone to the next possible path and instrumented that.

Figure 3 
Keyscan switch matrix
Figure 3
Keyscan switch matrix

Strategy #3 is to identify all the ways the software can manifest the problem, and then instrument up the chain, starting at the lowest level, until you can identify the problem. By the way, the problem turned out to be a manufacturing decision to use tin-plated connectors instead of gold. They saved a few pennies, only to spend hundreds of thousands of dollars to retrofit.

Strategy #4—Understand what in the hardware could cause the problem

Just as in strategy #3, where we categorized all the ways the software could cause the bug, here we do the same with hardware. Understand the circuits involved. Make sure you read and double-check the data sheets and errata sheets carefully. Previously, we read them to determine how to interface with the chip. Now we are specifically looking at the documentation from the perspective of this bug. 

We once had a cell modem that would occasionally lock up after power-up. Upon careful review, we found two conflicting timing diagrams. One of them included a note about not enabling a chip until X milliseconds after it was powered up. The other chart clearly showed power being applied to the chip while it was enabled and did not flag any problems. We were enabling the chip for a few microseconds before power was applied, and this sometimes caused the lock-up. Under test conditions, we determined the lock-up occurred about once every 5,000 to 10,000 times—hence why we didn’t find it during our original tests. Only after we had more than 1,000 in the field was it getting reported as a problem. If you haven’t noticed, hardware designers and their documentation are human too. 

Strategy #4 is to carefully look at and analyze what in the hardware could cause this specific bug, then dig into the documentation and errata sheets for that problem. Finally, instrument the hardware based on this analysis.

Strategy #5—Strategic Logging

Paul Bunyan had it right. Logging is important. In my previous article in this series, I mentioned the importance of building logging into your design. But when debugging a system that might have a software or hardware bug, logging needs to be added that specifically identifies what is happening in and around the bug. Logging the hardware states and registers in and around the area of the bug can be helpful.

One time, after we had about 100,000 of a particular version of a solar energy monitor in the field, we started getting random crashes and reboots. By first identifying as many things as we could think of and placing logs that would help us narrow the choices down, we eventually traced it to a watchdog chip that was failing. In the failure mode, it timed out before the time at which it was programmed to trip the watchdog and thus occasionally reset the system. We found that our fab house was using a cleaning agent on the PCB that the small print on the chip’s datasheet specifically prohibited. 

You mean the fab house has to read the small print on every chip that is used in a design? Or is it our responsibility as designers? Real engineers carefully read every detail on datasheets and errata. But I digress from our main topic. Strategy #5 is to create targeted logs in and around the bug.

Strategy #6—Run on Other Hardware

Rarely do we have the luxury of implementing this strategy. But it’s a good option to keep in mind. One time we had a series of postage scales we designed that all had the same code base. They differed in scale capacity, graphics interface, and keyboard type (Figure 4). Randomly, while entering data from the keyboard, all the smaller units would reset. Was the problem in the hardware or in the software’s keyboard handling? We were able to run the same keyboard software on the larger unit (even though the key mapping was funny) and determine that it wasn’t in the keyboard software. Eventually, we determined that it was a static issue with all of the keyboards used in the smaller units that didn’t exist in the higher-capacity units.

Figure 4
Ascom Postage Scale
Figure 4
Ascom Postage Scale
Conclusion

Understanding whether a problem is in the hardware or the software in an embedded system is an important step in debugging and fixing the problem. I have provided six strategies that I have used. What about you? Email me some of your strategies to determine if a bug is in hardware or in software. I would love to share them with the Circuit Cellar community.

Next time I will discuss some of the specific debug techniques that I have used in debugging embedded systems. But, as always, only in thin slices. 

Article Correction: Issue #391, page 46 contained an error. The sentence, “With a keyscan matrix, 12 wires can be used to read 144 discrete keys,” should instead have read: “With a keyscan matrix, 12 wires can be used to read 36 discrete keys.

RESOURCES
O2 Concepts: https://o2-concepts.com/products
Microchip: https://www.microchip.com/

SOURCES
Explanation of how an oxygen concentrator filters out the nitrogen: https://www.oxygentimes.com/guides/how-oxygen-concentrators-work
Explanation of how a keyboard scan matric works: https://pcbheaven.com/wikipages/How_Key_Matrices_Works/

PUBLISHED IN CIRCUIT CELLAR MAGAZINE • FEBRUARY 2023 #391 – Get a PDF of the issue

Keep up-to-date with our FREE Weekly Newsletter!

Don't miss out on upcoming issues of Circuit Cellar.


Note: We’ve made the Dec 2022 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Sponsor this Article
+ posts

Bob Japenga has been designing embedded systems since 1973. From 1988 - 2020, Bob led a small engineering firm specializing in creating a variety of real-time embedded systems. Bob has been awarded 11 patents in many areas of embedded systems and motion control. Now retired, he enjoys building electronic projects with his grandchildren. You can reach him at
Bob@ListeningToGod.org

Supporting Companies

Upcoming Events


Copyright © KCK Media Corp.
All Rights Reserved

Copyright © 2024 KCK Media Corp.

Debugging Embedded Real-Time Systems 

by Bob Japenga time to read: 10 min