The roots of failure modes effects and criticality analysis (FMECA) and failure modes effects analysis (FMEA) date back to World War II. FMEA is a subset of FMECA in which the criticality assessment has been omitted. Therefore, for simplicity, I’ll be using the terms FMECA and SWFMECA only in this article. FMECA was developed for identification of potential hardware failures and their mitigation to ensure mission success. During the 1950s, FMECA became indispensable for analyses of equipment in critical applications, such as those occurring in military, aerospace, nuclear, medical, automotive, and other industries.
FMECA is a structured, bottom-up approach considering a failure of each and every component, its impact on the system and how to prevent or mitigate such a failure. FMECA is often combined with fault tree analysis (FTA) or event tree analyses (ETA). The FTA differs from the ETA only in that the former is focused on failures as the top event, the latter on some specific events. Those analyses start with an event and then drill down through the system to their root cause.
In recent years, much effort has been spent on bringing hardware related analyses, such as reliability prediction, FTA, and FMECA into the realm of software engineering. Software failure modes and effects analysis (SWFMEA) and software failure modes, effects, and criticality analysis (SWFMECA) are intended to be software analyses analogous to the hardware ones. In this article I’ll cover SWFMECA as it specifically relates to embedded controllers.
Unlike the classic hardware FMECA based on statistically determined failure rates of hardware components, software analyses assume that the software design is never perfect because it contains faults introduced unintentionally by software developers. It is further assumed that in any complicated software there will always be latent faults, regardless of development techniques, languages, and quality procedures used. This is likely true, but can it be quantified?
SWFMECA should consider the likelihood of latent faults in a product and/or system, which may become patent during operational use and cause the product or the system to fail. The goal is to assess severity of the potential faults, their likelihood of occurrence, and the likelihood of their escaping to the customer. SWFMECA should assess the probability of mistakes being made during the development process, including integration, verification and validation (V&V), and the severity of these faults on the resulting failures. SWFMECA is also intended to determine the faults’ criticality by combining fault likelihood with the consequent failure severity. This should help to determine the risk arising from software in a system. SWFMECA should examine the development process and the product behavior in two separate analyses.
First, Development SWFMECA should address the development, testing and V&V process. This requires understanding of the software development process, the V&V techniques and quality control during that process. It should establish what types of faults may occur when using a particular design technique, programming language and the fault coverage of the verification and validation techniques. Second, Product SWFMECA should analyze the design and its implementation and establish the probability of the failure modes. It must also be based on thorough understanding of the processes as well as the product and its use.
In my opinion, SWFMECA is a bit of a misnomer with little resemblance to the hardware FMECA. Speculations what faults might be hidden in every line of code or every activity during software development is hardly realistic. However, there is resemblance with the functional level FMECA. There, system level effects of failures of functions can be established and addressed accordingly. Establishing the probability of those failures is another matter.
The data needed for such considerations are mostly subjective, their sources esoteric and their reliability debatable. The data are developed statistically, based on history, experience and long term fault data collection. Some data may be available from polling numerous industries, but how applicable they are to a specific developer is difficult to determine. Plausible data may perhaps be developed by long established software developers producing a specific type of software (e.g., Windows applications), but development of embedded controllers with their high mix of hardware/software architectures and relatively low-volume production doesn’t seem to fit the mold.
Engineers understand that hardware has limited life and customers have no problem accepting mean time between failures (MTBF) as a reality. But software does not fail due to age or fatigue. It’s all in the workmanship. I have never seen an embedded software specification requiring software to have some minimum probability of faults. Zero seems always implied.
SCORING & ANALYSIS
In the course of SWFMECA preparation, scores for potential faults should be determined: severity, likelihood of occurrence, and potential for escaping to the finished product. The scores between 1 to 10 are multiplied and thus the risk priority number (RPN) is obtained. An RPN larger than 200 should warrant prevention and mitigation planning. Yet the scores are very much subjective—that is, they’re dependent on the software complexity, the people, and other impossible to accurately predict factors. For embedded controllers the determination of the RPN appears to be just an analysis for the sake of analysis.
Statistical analyses are used every day from science to business management. Their usefulness depends on the number of samples and even with an abundance of samples there are no guarantees. SWFMECA can be instrumental for fine-tuning the software development process. In embedded controllers, however, software related failures are addressed by FMECA. SWFMECA alone cannot justify the release of a product.
In embedded controllers, causes of software failures are often hardware related and exact outcomes are difficult to predict. Software faults need to be addressed by testing, code analyses, and, most important, mitigated by the architecture. Redundancy, hardware monitors, and others are time proven methods.
Software begins as an idea expressed in requirements. Design of the system architecture, including hardware/software partitioning is next, followed by software requirements, usually presented as flow charts, state diagrams, pseudo code, and so forth. High and low levels of design follow, until a code is compiled. Integration and testing come next. This is shown in the ubiquitous chart in Figure 1.
Figure 1: Software development “V” model
During an embedded controller design, I would not consider performing the RPN calculation, just as I would not try to calculate software reliability. I consider those purely statistical calculations to be of little practical use. However, SWFMECA activity with software ETA and FTA based on functions should be performed as a part of the system FMECA. The software review can be to a large degree automated by tools, such as Software Call Tree and many others. Automation notwithstanding, one should always check the results for plausibility.
Software Call Tree tells us how different modules interface and how a fault or an event would propagate through the system. Similarly, Object Relational Diagram shows how objects’ internal states affect each other. And then there are Control Flow Diagram, Entity Relationship Diagram, Data Flow Diagram, McCabe Logical Path, State Transition Diagram, and others. Those tools are not inexpensive, but they do generate data which make it possible to produce high-quality software. However, it is important to plan all the tests and analyses ahead of the time. It is easy to get mired in so many evaluations that the project’s cost and schedule suffer with little benefit to software quality.
The assumed probability of a software fault becomes a moot point. We should never plunge ahead releasing a code just because we’re satisfied that our statistical development model renders what we think is an acceptable probability of a failure. Instead, we must assume that every function may fail for whatever reason and take steps to ensure those failures are mitigated by the system architecture.
System architecture and software analyses can only be started upon determination that the requirements for the system are sufficiently robust. It is not unusual for a customer to insist on beginning development before signing the specification, which is often full of TBDs (i.e., “to be defined”). This may be leaving so many open issues that the design cannot and should not be started in earnest. Besides, development at such a stage is a violation of certification rules and will likely result in exceeding the budget and the schedule. Unfortunately, customers can’t or don’t always want to understand this and their pressure often prevails.
The ongoing desire to introduce software into the hardware paradigm is understandable. It could bring software development into a fully predictable scientific realm. So far it has been resisting those attempts, remaining to a large degree an art. Whether it can ever become a fully deterministic process, in my view, is doubtful. After all, every creative process is an art. But great strides have been made in development of tools, especially those for analyses, helping to make the process increasingly more predictable.