Reflections on Software Development

Present-day equipment relies on increasingly complex software, creating ever-greater demand for software quality and security. The two attributes, while similar in their effects, are different. A quality software is not necessarily secure, while a secure software is not necessarily of good quality. Safe software is both of high quality and security. That means the software does what it is supposed to do: it prevents hackers and other external causes from modifying it, and should it fail, it does so in a safe, predictable way. Software verification and validation (V&V) reduces issues attributable to defects, that is to poor quality, but does not currently address misbehavior caused by external effects.

Poor software quality can result in huge material losses, even life. Consider some notorious examples of the past. An F-22 Raptor flight control error caused the $150 million aircraft to be destroyed. An RAF Chinook engine controller fault caused the helicopter crash with 29 fatalities. A Therac radiotherapy machine gave patients massive radiation overdoses causing death of two people. A General Electric power grid monitoring system’s failure resulted in a 48-hour blackout across eight US states and one Canadian province. Toyota’s electronic throttle controller was said to be responsible for the lives of 89 people.

Clearly, software quality is paramount, yet too often it takes the back seat to the time to market and the development cost. One essential attribute of quality software is its traceability. This means that every requirement can be traced via documentation from the specification down to the particular line of code—and, vice versa, every line of code can be traced up to the specification. The documentation (not including testing and integration) process is illustrated in Figure 1.

FIGURE 1: Simplified software design process documentation. Testing, verification and validation (V&V) and control documents are not shown.

FIGURE 1: Simplified software design process documentation. Testing, verification and validation (V&V) and control documents are not shown.

The terminology is that of the DO-178 standard, which is mandatory for aerospace and military software. (Similarly, hardware development is guided by DO-254.) Other software standards may use a different terminology, but the intentions are the same. DO-178 guides its document-driven process, for which many tools are available to the designer. Once the hardware-software partitioning has been established, software requirements define the software architecture and the derived requirements. Derived requirements are those that the customer doesn’t include in the specification and might not even be aware of them. For instance, turning on an indicator light may take one sentence in the specification, but the decomposition of this simple task might lead to many derived requirements.

Safety-Instrumented Functions

While requirements are being developed, test cases must be defined for each and every one of those requirements. Additionally, to increase the system safety, a so-called Safety-Instrumented Functions (SIF) should be considered. SIFs are monitors which cause the system to safely shut down if its performance fails to meet the previously defined safety limits. This is typically accomplished by redundancy in hardware, software or both. If you neglect to address such issues at an early development stage, you might end up with an unsafe system and having to redo a lot of work later.

Quality design is also a bureaucratic chore. Version control and configuration index must be maintained. The configuration index comprises the list of modules and their versions to be compiled for specific versions of the product under development. Without it, configuration can be lost and a great deal of development effort with it.

Configuration control and traceability are not just the best engineering practices. They should be mandated whenever software is being developed. Some developers believe that software qualification to a specific standard is required by the aerospace and military industries only. Worse, some commercial software developers still subscribe to the so-called iron triangle: “Get to market fast with all the features planned and high level of quality. But pick only two.”

Engineers in safety-critical industries (such as medical, nuclear, automotive, and manufacturing) work with methods similar to DO-178 to ensure their software performs as expected. Large original equipment manufacturers (OEMs) now demand adherence to software standards: IEC61508 for industrial controls, IEC62034 for medical equipment, ISO 26262 for automotive, and so forth. The reason is simple. Unqualified software can lead to costly product returns and expensive lawsuits.

Software qualification is highly labor intensive and very demanding in terms of resources, time, and money. Luckily, its cost has been coming down thanks to a plethora of automated tools now being offered. Those tools are not inexpensive, but they do pay for themselves quickly. Considering the risk of lawsuits, recalls, brand damage, and other associated costs of software failure, no company can really afford not to go through a qualification process.

Testing

As with hardware, quality must be built into the software, and this means following strict process rules. You can’t expect to test quality into the product at the end. Some companies have tried and the results have been the infamous failures noted above.
Testing embedded controllers often presents a challenge because you need the final hardware when it is not yet finished. Nevertheless, if you give testing due consideration as you prepare the software requirements, much can be accomplished by working in virtual or simulated environments. LDRA (www.ldra.com) is one great tool for this task.
Numerous methods exist for software testing. For example, dynamic code analysis examines the program during its execution, while the static analysis looks for vulnerabilities as well as programming errors. It has been shown mathematically that 100% test coverage is impossible to achieve. But even if it was, 35% to 40% of defects result from missing logic paths and another 40% from the execution of unique combinations of logic paths. Such defects wouldn’t get caught by testing, but can be mitigated by SIF.

Much embedded code is still developed in-house (see Figure 2). Is it possible for companies to improve programmers’ efficiency in this most labor-intensive task? Once again, the answer lies in automation. Nowadays, many tools come as complete suites providing various analyses, code coverage, coding standards compliance, requirements traceability, code visualization, and so forth. These tools are regularly seen at developers of avionic and military software, but they are not as frequently used by commercial developers because of their perceived high cost and steep learning curve.

FIGURE 2: Distribution of embedded software sources. Most is still developed in-house.

FIGURE 2: Distribution of embedded software sources. Most is still developed in-house.

With the growth of cloud computing and the Internet of Things (IoT), software security is gaining on an unprecedented importance. Some security measures can be incorporated in hardware while others are in software. Data encryption and password protection are the vital parts. Unfortunately, security continues to be not treated by some developers as seriously as it should be. Security experts warn that numerous IoT developers have failed to learn the lessons of the past and a “big IoT hack” in the near future is inevitable.

Security Improvements

On a regular basis, the media report on security breaches (e.g., governmental organization hacks, bank hacks, and automobile hacks). What can be done to improve security?

There are several techniques—such as Common Weakness Enumeration (CWE)—that can help to improve our chances. However, securing software is likely a task a lot more daunting than achieving comprehensive V&V test coverage. One successful hack proves the security is weak. But how many unsuccessful hacks by test engineers are needed to establish that security is adequate? Eventually, a manager, probably relying on some statistics, will have to decide that enough effort has been spent and the software can be released. Different types of systems require different levels of security, but how is this to be determined? And what about the human factor? Not every test engineer has the necessary talent for code breaking.

History teaches us that no matter how good a lock, a cipher, or a password someone has eventually broken it. Several security developers in the past challenged the public to break their “unbreakable” code for a reward, only to see their code broken within hours. How responsible is it to keep sensitive data and systems access available in the cyberspace just because it may be convenient, inexpensive, or fashionable? Have the probability and the consequences of a potential breach been always duly considered?

I have used cloud-based tools, such as the excellent mbed, but would not dream of using them for a sensitive design. I don’t store data in the cloud, nor would I consider IoT for any system whose security was vital. I don’t believe cyberspace can provide sufficient security for many systems at this time. Ultimately, the responsibility for security is ours. We must judge whether the use IoT or the cloud for a given product would be responsible. At present, I see little evidence to be convinced the industry is adequately serious about security. It will surely improve with time, but until it does I am not about to take unnecessary risks.


George Novacek is a professional engineer with a degree in Cybernetics and Closed-Loop Control. Now retired, he was most recently president of a multinational manufacturer for embedded control systems for aerospace applications. George wrote 26 feature articles for Circuit Cellar between 1999 and 2004. Contact him at gnovacek@nexicom.net with “Circuit Cellar”in the subject line.

How to Improve Software Development Predictability

The analytical methods of failure modes effects and criticality analysis (FMECA) and failure modes effects analysis (FMEA) have been around since the 1940s. In recent years, much effort has been spent on bringing hardware related analyses such as FMECA into the realm of software engineering. In “Software FMEA/FMECA,” George Novacek takes a close look at software FMECA (SWFMECA) and its potential for making software development more predictable.

The roots of failure modes effects and criticality analysis (FMECA) and failure modes effects analysis (FMEA) date back to World War II. FMEA is a subset of FMECA in which the criticality assessment has been omitted. Therefore, for simplicity, I’ll be using the terms FMECA and SWFMECA only in this article. FMECA was developed for identification of potential hardware failures and their mitigation to ensure mission success. During the 1950s, FMECA became indispensable for analyses of equipment in critical applications, such as those occurring in military, aerospace, nuclear, medical, automotive, and other industries.

FMECA is a structured, bottom-up approach considering a failure of each and every component, its impact on the system and how to prevent or mitigate such a failure. FMECA is often combined with fault tree analysis (FTA) or event tree analyses (ETA). The FTA differs from the ETA only in that the former is focused on failures as the top event, the latter on some specific events. Those analyses start with an event and then drill down through the system to their root cause.

In recent years, much effort has been spent on bringing hardware related analyses, such as reliability prediction, FTA, and FMECA into the realm of software engineering. Software failure modes and effects analysis (SWFMEA) and software failure modes, effects, and criticality analysis (SWFMECA) are intended to be software analyses analogous to the hardware ones. In this article I’ll cover SWFMECA as it specifically relates to embedded controllers.

Unlike the classic hardware FMECA based on statistically determined failure rates of hardware components, software analyses assume that the software design is never perfect because it contains faults introduced unintentionally by software developers. It is further assumed that in any complicated software there will always be latent faults, regardless of development techniques, languages, and quality procedures used. This is likely true, but can it be quantified?

SOFTWARE ANALYSIS

SWFMECA should consider the likelihood of latent faults in a product and/or system, which may become patent during operational use and cause the product or the system to fail. The goal is to assess severity of the potential faults, their likelihood of occurrence, and the likelihood of their escaping to the customer. SWFMECA should assess the probability of mistakes being made during the development process, including integration, verification and validation (V&V), and the severity of these faults on the resulting failures. SWFMECA is also intended to determine the faults’ criticality by combining fault likelihood with the consequent failure severity. This should help to determine the risk arising from software in a system. SWFMECA should examine the development process and the product behavior in two separate analyses.

First, Development SWFMECA should address the development, testing and V&V process. This requires understanding of the software development process, the V&V techniques and quality control during that process. It should establish what types of faults may occur when using a particular design technique, programming language and the fault coverage of the verification and validation techniques. Second, Product SWFMECA should analyze the design and its implementation and establish the probability of the failure modes. It must also be based on thorough understanding of the processes as well as the product and its use.

In my opinion, SWFMECA is a bit of a misnomer with little resemblance to the hardware FMECA. Speculations what faults might be hidden in every line of code or every activity during software development is hardly realistic. However, there is resemblance with the functional level FMECA. There, system level effects of failures of functions can be established and addressed accordingly. Establishing the probability of those failures is another matter.

The data needed for such considerations are mostly subjective, their sources esoteric and their reliability debatable. The data are developed statistically, based on history, experience and long term fault data collection. Some data may be available from polling numerous industries, but how applicable they are to a specific developer is difficult to determine. Plausible data may perhaps be developed by long established software developers producing a specific type of software (e.g., Windows applications), but development of embedded controllers with their high mix of hardware/software architectures and relatively low-volume production doesn’t seem to fit the mold.

Engineers understand that hardware has limited life and customers have no problem accepting mean time between failures (MTBF) as a reality. But software does not fail due to age or fatigue. It’s all in the workmanship. I have never seen an embedded software specification requiring software to have some minimum probability of faults. Zero seems always implied.

SCORING & ANALYSIS

In the course of SWFMECA preparation, scores for potential faults should be determined: severity, likelihood of occurrence, and potential for escaping to the finished product. The scores between 1 to 10 are multiplied and thus the risk priority number (RPN) is obtained. An RPN larger than 200 should warrant prevention and mitigation planning. Yet the scores are very much subjective—that is, they’re dependent on the software complexity, the people, and other impossible to accurately predict factors. For embedded controllers the determination of the RPN appears to be just an analysis for the sake of analysis.

Statistical analyses are used every day from science to business management. Their usefulness depends on the number of samples and even with an abundance of samples there are no guarantees. SWFMECA can be instrumental for fine-tuning the software development process. In embedded controllers, however, software related failures are addressed by FMECA. SWFMECA alone cannot justify the release of a product.

EMBEDDED SOFTWARE

In embedded controllers, causes of software failures are often hardware related and exact outcomes are difficult to predict. Software faults need to be addressed by testing, code analyses, and, most important, mitigated by the architecture. Redundancy, hardware monitors, and others are time proven methods.

Software begins as an idea expressed in requirements. Design of the system architecture, including hardware/software partitioning is next, followed by software requirements, usually presented as flow charts, state diagrams, pseudo code, and so forth. High and low levels of design follow, until a code is compiled. Integration and testing come next. This is shown in the ubiquitous chart in Figure 1.

Figure 1: Software development "V" model

Figure 1: Software development “V” model

During an embedded controller design, I would not consider performing the RPN calculation, just as I would not try to calculate software reliability. I consider those purely statistical calculations to be of little practical use. However, SWFMECA activity with software ETA and FTA based on functions should be performed as a part of the system FMECA. The software review can be to a large degree automated by tools, such as Software Call Tree and many others. Automation notwithstanding, one should always check the results for plausibility.

TOOLS

Software Call Tree tells us how different modules interface and how a fault or an event would propagate through the system. Similarly, Object Relational Diagram shows how objects’ internal states affect each other. And then there are Control Flow Diagram, Entity Relationship Diagram, Data Flow Diagram, McCabe Logical Path, State Transition Diagram, and others. Those tools are not inexpensive, but they do generate data which make it possible to produce high-quality software. However, it is important to plan all the tests and analyses ahead of the time. It is easy to get mired in so many evaluations that the project’s cost and schedule suffer with little benefit to software quality.

The assumed probability of a software fault becomes a moot point. We should never plunge ahead releasing a code just because we’re satisfied that our statistical development model renders what we think is an acceptable probability of a failure. Instead, we must assume that every function may fail for whatever reason and take steps to ensure those failures are mitigated by the system architecture.

System architecture and software analyses can only be started upon determination that the requirements for the system are sufficiently robust. It is not unusual for a customer to insist on beginning development before signing the specification, which is often full of TBDs (i.e., “to be defined”). This may be leaving so many open issues that the design cannot and should not be started in earnest. Besides, development at such a stage is a violation of certification rules and will likely result in exceeding the budget and the schedule. Unfortunately, customers can’t or don’t always want to understand this and their pressure often prevails.

The ongoing desire to introduce software into the hardware paradigm is understandable. It could bring software development into a fully predictable scientific realm. So far it has been resisting those attempts, remaining to a large degree an art. Whether it can ever become a fully deterministic process, in my view, is doubtful. After all, every creative process is an art. But great strides have been made in development of tools, especially those for analyses, helping to make the process increasingly more predictable.

This article appears in Circuit Cellar 297, April 2015.