System safety assessment provides a standard, generic method to identify, classify, and mitigate hazards. It is an extension of failure mode effects and criticality analysis and fault-tree analysis that is necessary for embedded controller specification.
System safety assessment was originally called ”system hazard analysis.” The name change was probably due to the system safety assessment’s positive-sounding connotation.
I participated in design reviews where failure effect classification (e.g., hazardous, catastrophic, etc.) had to be expunged from our engineering presentations and replaced with something more positive (e.g., “issues“ instead of “problems”), lest we wanted to risk the wrath of buyers and program managers.
System safety assessment is in many ways similar to a failure mode effects and criticality analysis (FMECA) and fault-tree analysis (FTA), which I described in “Failure Mode and Criticality Analysis” (Circuit Cellar 270, 2013). However, with safety assessment, all possible system faults—including human error, electrical and mechanical subsystems’ faults, materials, and even manuals—should be analyzed. The impact of their faults and errors on the system safety must also be considered. The system hazard analysis then becomes a basis for subsystems’ specifications.
Performing FMECA and FTA on your subsystem ensures all its potential faults become detected and identified. The faults’ signatures can be stored in a nonvolatile memory or communicated to a display console, but you cannot choose how the controller should respond to any one of those faults. You need the system hazard analysis to tell you what corrective action to take. The subsystem may have to revert to manual control, switch to another control channel, or enter a degraded performance mode. If you are not the system designer, you have little or no visibility of the faults’ potential impact on the system safety.
For example, an automobile consists of many subsystems (e.g., propulsion, steering, braking, entertainment, etc.). The propulsion subsystem comprises engine, transmission, fuel delivery, and possibly other subsystems. A part of the engine subsystem may include a full-authority digital engine controller (FADEC).
Engine controllers were originally mainly mechanical devices, but with the arrival of the microprocessor, they have become highly sophisticated electronic controllers. Currently, most engines—including aircraft, marine, automotive, or utility (e.g., portable electrical generator turbines)—are controlled by some sort of a FADEC to achieve best performance and safety. A FADEC monitors the engine performance and controls the fuel flow via servomotor valves or stepper motors in response to the commanded thrust plus numerous operating conditions (e.g., atmospheric and internal pressures, external and internal engine temperatures in several locations, speed, load, etc.).
The safety assessment mostly depends on where and how essentially identical systems are being used. A car’s engine failure, for example, may be nothing more than a nuisance with little safety impact, while an aircraft engine failure could be catastrophic. Conversely, an aircraft nosewheel steer-by-wire can be automatically disconnected upon a fault. And, with a little increase of the pilots’ workload, it may be substituted by differential braking or thrust to control the plane on the ground. A similar failure of an automotive steer-by-wire system could be catastrophic for a car barreling down the freeway at 70 mph.
System safety analysis comprises the following steps: identify and classify potential hazards and associated avoidance requirements, translate safety requirements into engineering requirements, design assessment and trade-off support to the ongoing design, assess the design’s relative compliance to requirements and document findings, direct and monitor specialized safety testing, and monitor and review test and field issues for safety trends.
The first step in hazard analysis is to identify and classify all the potential system failures. FMECA and FTA provide the necessary data. Table 1 shows an example and explains how the failure class is determined.
|No safety impact.||Does not significantly reduce system safety. Required actions are within the operator’s capabilities.||Reduces the capability of the system or operators to cope with adverse operating conditions. Can cause major illness, injury, or property damage.||Significantly reduces the capability of the system or the operator’s ability to cope with adverse conditions to the extent of causing serious or fatal injury to several people.||Total loss of system control resulting in equipment loss and/or multiple fatalities.|
The next step determines each system-level failure’s frequency occurrence (see Table 2). This data comes from the failure rates calculated in the course of the reliability prediction, which I covered in my two-part article “Product Reliability” (Circuit Cellar 268–269, 2012) and in “Quality and Reliability in Design” (Circuit Cellar 272, 2013).
|Frequent||Probability of occurrence per operation is equal or greater than 1 × 10–3|
|Probable||Probability of occurrence per operation is less than 1 × 10–3 or greater than 1 × 10–5|
|Occasional||Probability of occurrence per operation is less than 1 × 10–5 or greater than 1 × 10–7|
|Remote||Probability of occurrence per operation is less than 1 × 10–7 or greater than 1 × 10–9|
|Improbable||Probability of occurrence per operation is less than 1 × 10–9|
Based on the two tables, the predictive risk assessment matrix for every hazardous situation is created (see Table 3). The matrix is a composite of severity and likelihood and can be subsequently classified as low, medium, serious, or high. It is the system designer’s responsibility to evaluate the potential risk—usually with regard to some regulatory requirements—to specify the maximum hazard levels acceptable for every subsystem. The subsystems’ developers must comply with and satisfy their respective specifications. Electronic controllers in safety-critical applications must present low risk due to their subsystem fault.
|Probability / Severity||Catastrophic (1)||Critical (2)||Marginal (3)||Negligible (4)|
The system safety assessment includes both software and hardware. For aircraft systems, the required risk level determines the development and quality assurance processes as anchored in DO-178 Software Considerations in Airborne Systems and Equipment Certification and DO-254 Design Assurance Guidance for Airborne Electronic Hardware.
Some non-aerospace industries also use these two standards; others may have their own. Figure 1 shows a typical system development process to achieve system safety.
The common automobile power steering is, by design, inherently low risk, as it continues to steer even if the hydraulics fail. Similarly, some aircraft controls continue to be the old-fashioned cables but, like the car steering, with power augmentation. If the power fails, you just need more muscle. This is not the case with the more prevalent drive- or fly-by-wire systems.
How can the risk be mitigated to at least 109 probability for catastrophic events? The answer is redundancy. A well-designed electronic control channel can achieve about 105 probability of a single fault. That’s it. However, the FTA shows that by ANDing two such processing channels, the resulting failure probability will decrease to 1010, thus mitigating the risk to an acceptable level. An event with 109 probability of occurring is, for many systems, acceptable as just about “never happening,” but there are requirements for 1014 or even lower probability. Increasing redundancy will enable you to satisfy the specification.
Once I saw a controller comprising three independent redundant computers, with each computer also being triple redundant. Increasing safety by redundancy is why there are at least two engines on every commercial passenger carrying aircraft, two pilots, two independent hydraulic systems, two or more redundant controllers, power supplies, and so forth.
Human engineering, to use military terminology, is not the least important for safety and sometimes not given sufficient attention. MIL-STD-1472F, the US Department of Defense’s Design Criteria Standard: Human Engineering, spells out many requirements and design constraints to make equipment operation and handling safe. This applies to everything, not just electrical devices.
For example, it defines the minimum size of controls if they may be operated with gloves, the maximum weight of equipment to be located above a certain height, the connectors’ location, and so forth. In my view, every engineer should look at this interesting standard.
Non-military equipment that requires some type of certification (e.g., most electrical appliances) is usually fine in terms of human engineering. Although there may not be a specific standard guiding its design in this respect, experienced certificating examiners will point out many shortcomings. But there are more than enough fancy and expensive products on the market, which makes you wonder if the designer ever tried to use the product himself.
By putting a little thought beyond just the functional design, you can make your product attractive, easy to operate, and safe. It may be as simple as asking a few people who are not involved with your design to use the product before you release it to production.