Basics of Design CC Blog Research & Design Hub

Debugging Embedded Real-Time Systems (Part 2)

Figure 1 FORTRAN Z(1) = Y + W(1) Punch Card
Written by Bob Japenga

Bugs of Time

This month, Bob continues this series by identifying specific types of errors in our code which he calls “bugs of time.” Part of identifying these bugs is knowing what to look for. Identifying these bugs is the first step in ridding our systems of these enemies of ours.

  • What is a memory leak?

  • What is Fragmented Dynamic Memory (HEAP)?

  • What is race condition?

  • What is a non-resident function?

  • What is a non-thread-safe function?

  • What is priority inversion?

  • What is a deadlock?

  • IBM Mainframe

  • C-Library

  • Memory Management Unit

  • GNU tool

  • GNU C Library

My first software program was written in FORTRAN IV. I entered the code on punch cards (Figure 1), and had to submit them to a computer operator, who would load them into a big IBM mainframe. My recurring bug was an infinite loop. Thankfully, the IBM operating system would throw my program off the machine, because it was running too long. Otherwise, I would have shut down the entire university computing center. These bugs seemed daunting, and I seemed to create them pretty frequently during the first few weeks.

Figure 1 FORTRAN Z(1) = Y + W(1) Punch Card
Figure 1
FORTRAN Z(1) = Y + W(1) Punch Card

Bugs of time are those that show up after some period of time. The period can be long or random. They do not happen every time a function is called—even with the same environment and parameters. Something in the system fails after it runs for some duration. Let’s dive right in and look at some bugs of time.

Little did I know that much more complex bugs were awaiting me in just a few years. One class of those complex bugs is what I call “bugs of time” (Figure 2).”

Figure 2 A bug of time
Figure 2
A bug of time
MEMORY LEAKS

When you allocate memory and don’t free it after you are through with it, you can generate a memory leak. The first large embedded C++ design I created had hundreds of memory leaks. It took me days to track all of them down. In one very large embedded system we developed, there was a 100-byte memory leak approximately every hour. We had about 16MB of free memory in the system. It would exhaust free memory in about 18 years of continuous operation. Now that was a bug of time (Figure 2)! We spent a lot of time trying to find it, and finally gave up. We knew that the system rebooted at least once a year, so we just needed to provide about 1MB of margin. But it was hard to leave that in the system and see it listed every week in the bug tracker software.

What to look for: Obviously, a memory leak results in an out-of-memory condition. But it’s a bug before you run out of memory. Tools that we will discuss in a later article will help you find them. But first, you need to have a way to monitor free memory dynamically and perhaps through an exportable log. Then you look for slow trends indicating a reduction of available memory. I say “trends,” because in large systems (embedded Linux, for example), finding a small leak needs a statistically observable trend. That attribute made our 100-byte leak very difficult to detect.

FRAGMENTED DYNAMIC MEMORY (HEAP)

“Heap” is the memory that gets allocated with a new or malloc function. Heap fragmentation happens when the heap is broken into many small segments spread throughout the heap. Over time, you may have plenty of heap available, but not a contiguous block of memory large enough to meet your demands. Let me illustrate this with an extreme case, using a simple dynamic memory manager.

— ADVERTISMENT—

Advertise Here

Imagine that you have set up 1MB of dynamic memory. You have calculated that you use about 550KB of memory at any one time. Imagine that you allocate 500KB initially. Figure 3 shows your initial memory map after this allocation.

Next you allocate a 20KB block, which leaves the memory map as shown in Figure 4. Then you deallocate the 500KB block and allocate another 20KB, which your dynamic memory manager takes from the first available block (Figure 5). Now, when you try to allocate another 500KB block, there is no memory available. The C-library tells you, however, that you have 960KB free. Your system fails and you don’t know why.

Figure 3 Initial Heap Memory Map
Figure 3
Initial Heap Memory Map
Figure 4 Heap Memory Map after 20k malloc
Figure 4
Heap Memory Map after 20k malloc
Figure 5 Heap Memory Map when “Full”
Figure 5
Heap Memory Map when “Full”

You might ask, “Why doesn’t the memory manager unit (MMU) defragment the heap?” But if it did, what would happen to the pointers pointing to the old heap?” A better hardware solution would be for the MMU to create a virtual table of non-contiguous memory to create contiguous memory out of fragmented memory virtually. Even if your MMU did that, you would get variable timing to memory access, as the heap becomes more and more fragmented. But that would be a different bug.

What to look for: Depending on the margin you have allowed in the amount of heap memory, this bug can take a long time to manifest itself. I am sure you can imagine how the memory could become checkerboarded after a long period of time. If you have written your code to verify and to handle out-of-heap memory conditions, you will know when it happens. You can also track the size of the largest available block, and look for trends. If you are not handling out-of-heap memory conditions because, by design, you never allocate more than the amount that you have, shame on you! In that case, you will get all kinds of different manifestations and bizarre failures (such as segment faults and errant pointers). But sometimes you will get an out-of-heap error code from a system in the field after it has been deployed for months or even years. And you cannot understand why that is happening. It is probably heap fragmentation.

Early on in our experience with C++, we saw this a lot. The embedded system had lots of dynamic objects on the screen that constantly came and went. We learned that recycling screen objects, rather than deleting them, reduced the fragmentation—but it did not eliminate the problem. Resetting the system every so often at idle times was also a solution that we used.

RACE CONDITIONS

A “race condition” is created when the result of an algorithm is dependent on the sequence of parallel threads. (I will be using the term: “threads” to represent threads or tasks.) This can be between threads, or between a function call and an interrupt service routine (ISR), between two or more ISRs, or between a single software thread and hardware. I had one immature engineer tell me he couldn’t have any race conditions because he had no RTOS and no interrupts.

A simple race condition could consist of a hardware counter that is monitored by software for a zero crossover by looking for the counter to reach 0. When the counter reaches its maximum value, there is a race for the software to read it before the counter increments again. Most of the time, the software has plenty of time to read it. But every once in a while—for example when a wear-leveling algorithm kicks in—it loses the race. This is why race conditions are bugs of time. You never know when software is going to lose the race. But if you run the software long enough, it will fail.

Of course, most race conditions are more complex, and they are a challenge to find. They may only get uncovered in the presence of unexpected inputs, which is what happened with the race condition that contributed to the great Northeast power blackout of 2003 [1]. Rigorous code review is often the only hope—looking for every place where the code is sequence dependent.

What to look for: How do you know you have a race condition? Certainly, failures that occur at random times are one indicator. Missed deadlines or missed data points are other indicators, but those can point to a host of problems. It all depends on what the race condition causes to happen, and that is totally dependent on your code. Slowing the system down or speeding it up, and observing if the average rate of occurrences changes is one approach. Often, instrumenting a race condition causes it to go away, it is a class of bugs called “heisenbugs.” Heisenbugs are those bugs that seem to disappear or alter their behavior when you try to observe them.

— ADVERTISMENT—

Advertise Here

NON-REENTRANT AND NON-THREAD-SAFE FUNCTIONS

I call these particular bugs of time “concurrency bugs,” because they only manifest themselves in multi-threaded environments. They are very different from one another, but can manifest themselves in a similar way. Let’s look at each.

Non-reentrant functions: A function is “reentrant” if it can be interrupted and safely called again before the previous call completes. Any code that holds static data or global variables cannot be reentrant, nor can reentrant code return a pointer to static data. Finally, and obviously, a reentrant function cannot call a non-reentrant function. This attribute (as with thread-safe discussed below) is much misunderstood, and can cause all sorts of bugs of time if not fully understood.

The probability of a non-reentrant function to be interrupted and then re-entered can be vanishingly small. Thus, your system may pass very rigorous testing over hundreds of hours of operation. But put 100,000 of these units into the field, and this can bite badly and soon.

Non-thread-safe functions: The same can be said about non-thread-safe functions. A function is thread-safe if it protects access to shared resources from concurrent access.

As designers of multi-threaded systems, it is imperative that you know which functions in the libraries you use are thread-safe and reentrant. Then you must ensure that you only use these kinds of functions in your threads (hopefully with a check at build time), or provide protection (such as mutexes or semaphores) to protect them.

Frankly, I have found that identifying these is not easy, and lists are often wrong. For example, FreeRTOS has excellent documentation, but to easily find in one place what functions in FreeRTOS and your supporting libraries are reentrant and thread-safe requires more work than it should.

Another criterion that all designers of multi-threaded systems must be aware of is whether the thread or process uses one instantiation of a library, or every process uses a new instantiation. Again, this information is very easy to determine for each RTOS, but is not always brought to our attention. If, for example you are using Linux and the GNU tool chain, a new instantiation of the GNU C Library (glibc) is used for every process (with static binding). The process can use non-reentrant and non-thread-safe functions across other processes, but not across threads within that process. If, in contrast, you use dynamic binding, glibc is shared across processes, and you must only use the reentrant and thread-safe functions within glibc.

Let me stress that these are understandings that you must have when you begin the coding of your design, and need to be enforced throughout the life cycle. At the beginning of a project with concurrency, you must make a list of functions that cannot be used in concurrent functionality. Missing one can cause a bug of time at the worst possible moment.

What to look for: If you have non-reentrant functions or non-thread-safe functions in concurrent threads of execution, the bugs can manifest themselves in a variety of ways. You can get wrong results from deterministic algorithms. You can see dropped data or extra data. And you will not see it every time, even with the “exact conditions.” If we suspect a concurrency bug, we sometimes set all threads and processes not to allow pre-emption. Does the problem go away? You may need to turn off interrupts one at a time, and see if the problem goes away.

But far and away, the best first approach is to create a list of all non-reentrant and non-thread-safe functions, and analyze where they are used. You may not find your bug, but you might find others that otherwise may not be uncovered until you have a million units in the field for a year.

PRIORITY INVERSION

“Priority inversion” occurs when a lower-priority thread prevents a higher-priority thread from running. Usually, this happens because both threads are sharing a resource. Priority inversion can happen in well documented and “well” tested systems—as happened in the Mars Pathfinder mission in 1997 [2].

Figure 6 provides a simple example of bounded priority inversion. Both a high-priority thread and a low-priority thread share a resource protected by Mutex X. The low-priority thread grabs Mutex X when the high-priority thread is not running, thereby blocking the high-priority thread from running when it should. By design, critical sections surrounded by locks and unlocks are usually short. That is why this type of priority inversion is called “bounded.”

Figure 6
Simple (Bounded) Priority Inversion

But, in our example above, if another medium-priority thread pre-empts the lower-priority thread while it has locked Mutex X, the medium-priority thread can hold off the higher-priority thread for a much longer period of time. Thus, these are called “unbounded” (Figure 7).

Figure 7 Unbounded Priority Inversion
Figure 7
Unbounded Priority Inversion

What to look for: In the bounded case, you would look for missed deadlines, missed outputs, missing data, watch-dog trips, or out-of-spec jitter. Once we had a digital output that was only out of tolerance about once a day. We captured it overnight with an oscilloscope in envelope mode.

DEADLOCKS

The classic story of the monkey and the coconut trap [3] illustrates the deadlock. A hollow coconut containing rice is chained to a stake. The monkey has two conflicting threads running in his mind. One wants to get some rice by inserting his hand through the hole in the coconut, but the other thread simultaneously wants to pull his rice-filled hand out. His open hand fits through the hole, but his clenched fist does not (Figure 8).

— ADVERTISMENT—

Advertise Here

Figure 8 The Coconut Trap Deadlock
Figure 8
The Coconut Trap Deadlock

Deadlocks in software happen when two or more threads want the same resource, and one won’t give up the resource until something happens. And that something is dependent on the thread waiting for the resource to be released. Of course, it can get much more convoluted where thread A is waiting for thread B, which is waiting for thread C, which is waiting for thread A, and so on.

What to look for: Deadlocks are often obvious with tasks that are frozen in a wait state, watchdogs that get tripped, locks that never get released, events that get missed, or data that gets dropped. Unless it is a temporary condition, it is usually catastrophic.

CONCLUSION

I hate bugs of time. They are hard to replicate and even harder to find. Next time, we will look at those bugs that do not fall under my classifications of “bugs of scale” or “bugs of time.” But of course, only in thin slices. 

Additional materials from the author are available at:
www.circuitcellar.com/article-materials
References [1] to [3] as marked in the article can be found there.

References
[1] The great Northeastern US power blackout of 2003:
https://www.ieso.ca/corporate-ieso/media/also-of-interest/blackout-2003
[2] What really happened on the 1997 Mars Pathfinder mission:
http://www.cs.cornell.edu/courses/cs614/1999sp/papers/pathfinder.html
[3] The monkey and the coconut trap:
https://www.theguardian.com/lifeandstyle/2014/nov/14/how-to-avoid-monkey-trap-oliver-burkeman

PUBLISHED IN CIRCUIT CELLAR MAGAZINE • JUNE 2022 #383 – Get a PDF of the issue

Keep up-to-date with our FREE Weekly Newsletter!

Don't miss out on upcoming issues of Circuit Cellar.


Note: We’ve made the May 2020 issue of Circuit Cellar available as a free sample issue. In it, you’ll find a rich variety of the kinds of articles and information that exemplify a typical issue of the current magazine.

Would you like to write for Circuit Cellar? We are always accepting articles/posts from the technical community. Get in touch with us and let's discuss your ideas.

Sponsor this Article
+ posts

Bob Japenga has been designing embedded systems since 1973. From 1988 - 2020, Bob led a small engineering firm specializing in creating a variety of real-time embedded systems. Bob has been awarded 11 patents in many areas of embedded systems and motion control. Now retired, he enjoys building electronic projects with his grandchildren. You can reach him at
Bob@ListeningToGod.org

Supporting Companies

Upcoming Events


Copyright © KCK Media Corp.
All Rights Reserved

Copyright © 2022 KCK Media Corp.

Debugging Embedded Real-Time Systems (Part 2)

by Bob Japenga time to read: 10 min