Bulletproofing Car Design
Formal tools can prove that a fault is safe, residual or latent—and they can quickly determine a worst-case/best-case diagnostic coverage without time-consuming analysis. In this article, Doug presents an improved formal verification flow that uses sequential logic equivalency checking (SLEC) to reduce the number of faults while simultaneously providing much higher quality of results.
The ISO 26262 automotive functional safety standard [1] requires performing random hardware failure analysis of electronic devices for automotive applications. The purpose is to test that a component’s safety functions and safety mechanisms work correctly. This is done after a design has been verified by randomly injecting faults and testing to see if the design still works correctly or if the faults are controlled or detected by a safety mechanism. This process of injecting faults and determining results is referred to as a fault campaign.
— ADVERTISMENT—
—Advertise Here—
Fault analysis has been around for years, and many methods have been created to optimize evaluation of hundreds of concurrent faults in specialized simulators. However, there are many challenges in running a fault campaign. In most designs, the number of faults to inject is very large. Simulating thousands of tests can lengthen project schedules and compete for resources. Creating a list of faults may not be trivial. Determining if tests activate and propagate faults can be labor intensive and difficult. Dealing with third party IPs and constraining them correctly under fault conditions may be tedious. Injecting multiple faults for latent failure analysis and using multiple failure modes (stuck-at0, stuck-at1, transient) only doubles and triples the effort. Lastly, gathering results for metric calculations typically requires some form of automation.
Fortunately, there is a better method for running a fault campaign—formal sequential logic equivalency checking (SLEC). Mentor, a Siemens Business, has developed a flow, called SafeCheck, which not only automates every step of the process, but also solves many issues associated with running a fault campaign. This formal platform-based flow allows engineers to focus on automotive functional safety goals, architecture, and requirements, without getting bogged down in the details of the fault campaign.
MINIMIZING FAULTS
As a starting point, an engineer uses a failure modes, effects, and diagnostic analysis (FMEDA) or creates a simple spreadsheet (CSV file) listing the safety functions, their safety mechanisms (both primary and secondary), any constraints to apply (such as disabling DFT when testing) and the fault detection interval (FDI) [2] required by the design’s hardware requirements.
— ADVERTISMENT—
—Advertise Here—
The sheer number of potential faults in a design makes the fault campaign process overwhelming. Using a formal tool to prune the number of faults separates the wheat from the chaff and makes the process more manageable. Because formal tools perform their own synthesis of the design into a Boolean representation for formal analysis, they have the ability to trace back the cone of influence (COI) of a signal by using the cone of influence command:
formal generate coi -var
<signal_name>
A cone of influence includes all the logic that has the possibility to affect the output signal. Anything outside of the cone is by definition irrelevant, since logically it cannot affect the cone. In other words, logic outside of the cone of influence can automatically be classified as safe without the need for further analysis.
Overlapping the cone of the safety function with the safety mechanism reveals what safety related logic is covered by the safety mechanism, or, in other words, what logic should be treated as a single-point fault or a residual fault (or latent faults when overlapping cones of safety mechanisms). Safety functions and mechanisms, even on large designs, may only consist of a small number of potential faults, which makes formal analysis possible even on large ICs. We’ll return to single-point faults and residual faults later in this article when we talk about analyzing faults with SLEC.
INJECTING A FAULT
Once faults are determined using the cone of influence, the faults in the cone are randomly injected. Typical failure modes considered in a fault campaign are stuck-at0, stuck-at1 and transient faults. With formal, simple constant constraints can be specified, as with synthesis tools, causing logic to be pruned away as logic is removed. Likewise, nets and registers can be cut, also known as cutpoints. Here is an example of a cutpoint directive:
— ADVERTISMENT—
—Advertise Here—
netlist cutpoint {dut.p_wbspi.clgen.cnt[8]}
Any signal not controlled by a driver becomes a formal control point, which formal can drive to any value (note: formal tools do not typically drive unknowns.) These cutpoints can introduce faults since the formal tool can drive either stuck-at or transient values onto them. Unfortunately, constants and cutpoints are created at compile time, which would require many compilations to add the potentially thousands and thousands of faults to inject.
Alternatively, a formal tool can create conditional cutpoints. For example:
netlist cutpoint {dut.p_wbspi.clgen.cnt[8]} -cond { fault == 100 } \-driver 0
Conditional cutpoints are specified for every possible fault in the design; however, the signal is only cut when the condition is true. This allows formal to run any number of targets, cut the appropriate signal to inject the fault, and require only one formal compilation.
ANATOMY OF A FAULT INJECTOR
The next step is to control the condition used in the cutpoint. The way to give formal control of a signal is to have no driver for it. This is referred to as a non-deterministic (ND) input since the formal engines can assign it to any value. Care must be taken to ensure that only one fault is selected per formal evaluation, otherwise multiple faults will be injected by the formal tool. Simple assumptions can constrain the fault selected:
bit [N-1:0] fault;
...
asm_stable_fault: assume property ( $stable(fault) );
Not all failure modes, however, call for a fault to be continuously stuck-at, and some requirements may even need to have the stuck-at faults injected at some time after reset. To accommodate this, a second ND-input is created to indicate when the fault should be injected:
bit nd_inject;
Getting assumptions to cover all cases can be a little tricky, so some straightforward SystemVerilog code is used to register when formal chooses to inject the fault, which makes writing cutpoint conditions, as well as other properties, easier:
bit injected;
always @($global_clock)
injected = injected || nd_inject;
Notice that $global_clock is used here since formal automatically infers the clock and reset, which is handy when automating a fault injection flow. If the clock and reset are known, then it would be best to use those instead. Now the formal assumption becomes:
asm_stable_fault: assume property ( injected |-> $stable(fault) );
And the conditional cutpoint is:
netlist cutpoint { dut.p_wbspi.clgen.cnt[8] } \
-cond { injected && fault == 100 } \
-driver ‘sb0
This method works for stuck-at faults as the injected signal remains asserted once the fault is injected and the cutpoint remains active from that point onwards. For transient faults, however, the cutpoint should only be active for a limited number of cycles. Instead of using injected, we can create another inject signal and constrain it for use with transient faults such that it creates a one-time pulse of a specified width:
netlist cutpoint { dut.p_wbspi.clgen.cnt[8] } \
-cond { inject && fault == 100 } \
-driver ‘sb0
In this example, there is no reset on the always block so formal may try to inject faults during reset:
bit nd_inject;
bit inject;
bit injected;
always @($global_clock) begin
inject = nd_inject;
injected = injected || inject;
end
// --------------------------------------
// Transient faults
// --------------------------------------
asm_inject_width : assume property ( $rose(inject) |-> inject[*WIDTH] ##1!inject );
asm_inject_stable : assume property ( $rose(inject) |-> $past(!injected));
Coding the always block with a reset solves this problem, or it can be solved using a special type of assumption—one that is called inside of an initial block. This forces the assumption to be applied only once formal evaluation begins, which happens when reset is released, and has the added benefit of indicating when a transient fault should be injected. The initial assumption could look like this:
property prop_init_seq;
( !inject[*START_MIN:START_MAX] ##1 inject ) and
( !injected[*START_MIN:START_MAX] ##1 injected );
endproperty
initial begin : INIT_SEQ
asm_init_seq: assume property ( prop_init_seq );end
Once the fault injector is created, it should be bound into the design or compiled in as a second top module.
PROPAGATING FAULTS WITH SLEC
With formal tools, stimulus generation is automatic. The difficulty comes in holding it back from using illegal or undefined input values. Usually, this involves writing lots of formal constraints and debugging lots of false negatives.
However, with formal SLEC, constraining the design becomes automatic. Unlike logic equivalency checking used to determine RTL-to-gate equivalency, which checks that the logic between registers is equivalent, SLEC uses formal methods to clock the design and determine if two entire designs are functionally equivalent. In other words, the functionality can be implemented in totally different ways with different timing, but the outputs remain the same. SLEC has many applications, such as checking that a design ported from VHDL to Verilog remains the same, or that a new feature does not break existing functionality.
With fault injection, a design can be compared with itself, with one instance containing the fault. If the two designs produce equivalent outputs even when a fault is injected, then clearly the fault does not affect the design; in other words, it can be classified as safe. SLEC automatically ties the inputs of the two instances together, so in essence the original design becomes the reference model that constrains the fault injected model. Formal constraints are no longer required and the entire legal state space can be explored without additional user input.
SLEC targets equivalency points, which are automatically inferred based on path names, so no manual intervention is needed when two instances of the same design are compared. If the outputs form the safety function, then the safety paths will automatically be checked for equivalency. If the safety function is a combination of signals or need qualified (like with a strobe or enable signal), an additional signal can be defined in SystemVerilog and a SLEC map can be created using a SLEC directive:
slec map { slec_wrapper.spec.top.safety_func } \
{ slec_wrapper.impl.top.safety_func }
In this example, the slec_wrapper is a top level wrapper created automatically by the Questa SLEC App. spec represents the specification or unmodified instance, and impl refers to the implementation or fault injected model.
SLEC mappings create SLEC targets, but it can fire from any fault being injected into the design. For ISO 26262 metric calculations, each fault must be classified so specific targets are required for each fault. Adding a condition to the SLEC mapping, as well as giving it a name, allows tracking of the fault results:
slec map { slec_wrapper.spec.top.safety_func } \
{ slec_wrapper.impl.top.safety_func } \
-name SLEC_safe_100_sa0 \
-cond {slec_wrapper.impl.fi.fault == 100 && \
slec_wrapper.impl.fi.injected }
The fi instance and injected signal come from the fault injector described in the previous section. In this example, if target SLEC_safe_100_sa0 fires, it means that formal SLEC has injected fault 100, which corresponds to a conditional cutpoint. A SLEC firing means that the two designs are not equivalent. A proof means that they are. In other words, a proof demonstrates that the fault is safe (Figure 1); whereas a firing demonstrates that the fault can propagate and that more analysis is required to see if it can be detected by a safety mechanism.
Proven SLEC results showing that the injected fault is safe—in other words, it does not affect the safety outputs
ANALYZING FAULTS
Injecting faults with SLEC to see if a fault can propagate is referred to as safe fault analysis in the Mentor SafeCheck flow. Simple SLEC mappings, as shown previously in the article, typically evaluate quickly, and the Questa SLEC App is optimized to find equivalency points and collapse logic so large designs and state spaces can be handled efficiently. However, proving that a fault can propagate does not prove that it can be detected.
By definition, safety related logic not covered by any intended safety mechanism is considered a single-point fault (SPF). SPFs are easy to find using nothing more than a structural cone of logic. For everything else protected by a safety mechanism, the cones of influence can be overlapped to see what is covered and what is not. Anything not in the cone of the safety mechanism is structurally a residual fault. Even so, there is no guarantee that a fault outside of a safety mechanism cone can propagate, which is where safety fault analysis comes in. For all faults within the cone of influence of a safety mechanism, residual fault analysis needs to be performed. With this analysis, not only are we looking for a violation of the safety goal, but we are also looking to see if the fault is detected by the safety mechanism within a specified time window, called the fault detection interval [2] (Figure 2). Note: this differs from the fault tolerant time interval (FTTI), which is a system-level time interval, usually on the order of microseconds.
While simple SLEC mapping targets could be used and the results crossed, they do not adequately capture the residual fault scenario. SLEC targets by themselves do not include the concept of a window in time, as shown in Figure 2, and once a SLEC violation fires, any assumptions or subsequent behavior is ignored. Instead, SLEC supports SystemVerilog Assertions (SVA) properties, where any specific behavior can be described. So, for residual fault analysis, a simple cover property will suffice, along with some SystemVerilog code to simplify the writing of the property:
bit violation;
bit detection;
default clocking cb @($global_clock); endclocking
always @($global_clock) begin
...
if ( injected && (safety_function_violated)) violation <= 1;
safety_mechanism_expression
if ( injected && ()) detection <= 1;
end
cov_res: cover property ( fault == 100 && violation ##[FDI] !detection );
In this example, the time window is specified by the parameter FDI. Technically, the ISO standard defines the FDI as the time from when the fault is injected until it is detected, but practically speaking, it makes more sense to treat it as the time from the violation to the point of detection, since propagation delays through the design will greatly vary from path to path. If formal finds a way to cover the property where the safety function is violated and the fault goes undetected, then a residual fault has been detected. If the property is uncoverable, then it will be treated as a dual-point fault (DPF) for later analysis of latent faults.
Latent fault analysis involves injecting two faults into the analysis—one in the safety function known to be detectable and the other in the primary safety mechanism. If a design has only one safety mechanism, then all state bits in the primary safety mechanism are by definition latent; i.e., faults in the safety mechanism cannot be detected. Most designs will have a secondary mechanism, such as power-on self-test, which can find stuck-at faults in the primary safety mechanism. (Note: the ISO standard does not require transient fault analysis for latent faults.)
Crossing safety function faults with safety mechanism faults increases the number of testable combinations exponentially. With formal, however, adding more faults reduces analysis. Since a fault is created by cutting a signal, adding faults creates additional control points for formal. The more control points, the more degrees of freedom formal has, making it easier to find a solution for the target under analysis.
By constraining formal to use only DPFs in the safety function, latent fault SLEC targets can be written similar to residual fault targets with just one modification. We will create a fault ID variable for the safety mechanism called sm_fault and rename the fault signal as sf_fault for the safety function fault ID. Then the latent fault target becomes as follows:
always @($global_clock) begin
safety_function_violated
...
if ( injected && ())
primary_safety_mechanism_expression ||
violation <= 1;
if ( injected && (
secondary_safety_mechanism_expression))
detection <= 1;
end
cov_latent: cover property ( sm_fault == 32 && violation ##[MPFDI]!detection );
An additional conditional cutpoint is added for the second fault in the primary safety mechanism:
netlist cutpoint { dut.err_h0.spi_error } \
-cond { injected && sm_fault == 32 } \
-driver ‘sb0
SVA assumptions are added to constrain the analysis to only two faults, using only known DPFs:
asm_dpfs: assume property ( sf_fault inside{[1:709],[854:855],[1387:1394],[1430:1431]} );
asm_stable_sm_fault: assume property ( injected |-> $stable(sm_fault);
An advantage of this approach is that formal automatically chooses the DPF to use, and the appropriate fault is injected by the conditional cutpoints used during residual fault analysis.
An injected fault will have an effect down multiple paths within a design. Suppose a design has two ports, A and B. When a fault has no effect on port A, this implies it is a safe fault. However, the same fault may propagate down path B and go undetected, making it residual. In an IC, which is likely to have shared logic, prioritizing the fault classification is important. In this example, since there exists a path (to port B) where the fault is not detected, the fault should be classified as residual, not safe, even though it had no effect on port A. The fault priorities can be inferred from the flowchart found in figure B.2 of ISO26262-5:2018.
Once all the formal SLEC targets are evaluated, the results can be summed up in a report, as seen in Figure 3. More detailed reports can be created to evaluate missing safety coverage in the design. Any faults whose properties could not converge in formal can be exported as a fault list and used in a simulation or emulation regression to complete the safety metrics.
BEYOND FORMAL
Perhaps the biggest advantage of using a formal-based flow is that very quickly worst-case/best-case diagnostic coverage can be determined based on nothing more than structural analysis. As formal is able to verify results, the diagnostic coverage range narrows and confidence in the results increases.
Unlike other methods, when formal proves a fault is either safe, residual, or latent it is absolute and exhaustive. Anything formal cannot prove can be punted over to simulation and emulation. Likewise, software safety mechanisms are not an ideal candidate for formal, so these need another solution, such as emulation. In any case, formal easily reduces the fault analysis effort by pruning the fault lists and focusing on the cone of influence. With little more than a few inputs describing the safety functions and mechanisms, the entire formal flow is automated. Using this approach is really the best method for proving something is functionally safe.
For detailed article references and additional resources go to:
www.circuitcellar.com/article-materials
References [1] and [2] as marked in the article can be found there.
RESOURCES
Mentor, a Siemens Company | www.mentor.com
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • MARCH 2019 #344 – Get a PDF of the issue
Sponsor this ArticleDoug Smith is a functional verification consultant for Mentor, a Siemens Company, with expertise in UVM and formal technologies. Doug holds a master’s degree in Computer Engineering from the University of Cincinnati and a bachelor’s degree in Physics from Northern Kentucky University.