Stress and Statistics
It’s a fact of life that every electronic system eventually fails. Manufacturers use various methods to weed out most of the initial failures before shipping their product. Here, George discusses engineering attempts to bring some predictability into the reliability and life expectancy of electronic systems. In particular, he focuses on Highly Accelerated Lifetime Testing (HALT) and Highly Accelerated Stress Screening (HASS).
The British prime minister Disraeli once opined that there were lies, damned lies and statistics. That said, statistics are the driving force behind our efforts to determine a product’s reliability and its life expectancy. Statistics is a science in its own right. It works. But my article merely describes the application. It has no ambition to be a tutorial of statistical analysis. While I’m focusing here on predictability in electronic systems, the same principles apply to all systems, electronic or otherwise, so you can extend such engineering practice to other areas of product reliability too.
Product lifecycle is illustrated by the ubiquitous bathtub curve in Figure 1—I’ve shown this a few times in my previous articles. Just to quickly review its meaning: every product suffers from initial failures—often called infant mortality—shown by the blue line. Infant mortality failures are caused by components’ and products’ weaknesses, manufacturing process-induced stress, handling, assembly or by already existing flaws in the raw materials. Eventually, the initial failures become patent and with time diminish to zero.

Manufacturers use various methods to weed out most of the initial failures before shipping their product. Beginning with simple burn-in, which may be just keeping the product powered up for some time, to a full environmental stress screening (ESS) ([1] and [2]) or some combination in between. The major shortcoming of a simple burn-in is that it merely runs the product, sometimes at an elevated temperature. It may uncover drift-of-calibration problems, but will rarely induce a failure.
STRESS SCREENING
The challenge in eliminating initial failures is knowing how long and at what stress level the screening should be performed. Screening is already expensive in terms of the required labor, equipment and the quantity of finished product delayed before shipping. If you perform insufficiently stressful ESS you’ll end up with a potentially large number of defective product escapes. Conversely, if you toughen the test procedure its cost will increase, while too much stress will age the product and reduce its useful life—even causing destruction by overstress. To illustrate the point, consider how ridiculous it would be to drive a newly manufactured car for 10,000 miles in an extreme environment in order to declare it reliable and ready for shipping. We’ll return to the stress screening later in this article.
To put the scales of the bathtub into perspective, the ESS testing, depending on the product, may take days, sometimes a week or more. Meanwhile, the normal life expectancy may be in excess of twenty years for safety-critical products, or just three to five years for the consumer ones.
Once out of the infant mortality phase, products exhibit a stress related constant level of failures shown by the orange trace. This is predicted by statistical analysis [3] and expressed as a Mean Time Between Failures (MTBF) for repairable products or Mean Time To Failure (MTTF) for non-repairable ones. System designers specify the required MTBF/MTTF to the component design engineers based on the system application. One should always keep in mind, however, that the calculated MTBF/MTTF is not a precise number—it should be kept away from the bean counters. It’s solely a statistical probability, usually following a normal probability distribution curve shown in Figure 2. The end of the useful life is the wear-out period shown by the violet trace in Figure 1. The resulting lifecycle curve represented by the red trace is the sum of the three effects.
— ADVERTISMENT—
—Advertise Here—
Up to this point we have focused on the output of the manufacturing, where the ESS uncovers defects originated during the manufacturing process or due to defective components. It is, however, imperative that the robustness of the product is established during its design and development. To that end, the product is subjected to all kinds of electrical and environmental qualification stresses. Unfortunately, these tests, though successful, do not always provide assurance of satisfactory real-life performance. While the product may successfully pass many individual tests at high levels of stress, it may fail when exposed to a combination of stresses at significantly lower levels. That’s where endurance testing comes in.
ADVANTAGES OF HALT
As you can imagine, high endurance testing takes time and is expensive. We all have seen commercials with, for example, doors being continually opened and closed by a motor to prove the quality of their hinges to us. But that’s a little less than a joke. What does the industry do?
Enter Highly Accelerated Lifetime Testing (HALT). The fundamental principles of lifetime testing were established decades ago by military programs such as Reliability Growth Development Test (RDGT). In a nutshell, the operating unit under test (UUT) is subjected to increasing levels of stress until it fails. The failure is analyzed and the design modified to eliminate the weakness. After the fix the test continues, increasing the stress until a new failure occurs. It is repaired and round and round we go until the desirable level of robustness is achieved.
One major advantage of HALT over the old RDGT is that the lifecycle test duration can be slashed from months, even years, down to days and perhaps hours. Such time compression is accomplished thanks to the new breed of environmental test chambers. Thermotron [4] is a major manufacturer of such chambers. While the old units contained a vibration platform with maybe three degrees of freedom (3 DOF) at best, the current chambers feature 6 DOF. The rate of temperature change used to be around 8°C/minute (14.4°F/minute), but Thermotron’s HALT chamber, Figure 3, boasts a whopping 70°C/minute (126°F/minute) rate at a minimum.
This chamber, in addition to the extreme temperature stress, can vibrate the UUT randomly over a broad frequency spectrum and random directions to discover hidden mechanical resonances. In other words, the simultaneous stresses brought about by HALT are likely to precipitate all latent faults and make them patent.
Figure 4a and Figure 4b show what happens during the HALT. Similar to the RDGT, tests are carried out at increasing stress levels while identifying and repairing weak spots. By eliminating weak spots, we want to achieve the point where the component failures become random. Remember though, that before starting HALT, the specification must be correct and final—not in need of modifications. After the HALT the product specification must remain frozen. By then the product has gained healthy operating margins as shown by Figure 4b. How much of an improvement of the operating margin can be achieved depends on the specific product.
HASS FOR PRODUCTION TEST
Once the development HALT has been performed, manufacturing ESS can be replaced by faster, less demanding Highly Accelerated Stress Screening (HASS). Just like ESS, HASS will precipitate latent faults and improve reliability of shipped product, because HALT has already provided increased operating margins. Keep in mind that HASS is a production test, so it must not exert the stress levels of HALT, which would weaken the product before shipment. It is recommended that once HASS levels have been established, a production unit is run through HASS twenty times in sequence to verify that, after that much abnormal stress, the unit still shows no acquired weakness.
In high volume production it often becomes desirable to run HASS on production samples only. Before it can be done the production processes must be up to the standard and not reducing the design’s ruggedness. Usually the early production begins with 100% HASS. Only when the production process is evidently under full control, HASS sampling may be gradually introduced. At that time a process called Highly Accelerated Stress Audit (HASA) is implemented.
Clearly, quality product design and manufacturing are to a large degree governed by statistics. Most design engineers do not have the training in statistical analysis of a caliber required for the development of HALT, HASS or HASA procedures. You should leave it to experts to develop statistically relevant—yet also minimum—sample sizes. One might worry, for example, that if a supplier began delivering components with lower quality, substandard product could escape into the field. Still, if the sampling process is well tuned to the production rate. the problem will be quickly discovered. That’s because HASS stressing is well above the levels of normal product use. With high operating margin it is unlikely such problems could spill into the field.
— ADVERTISMENT—
—Advertise Here—
Statistics are a specialized, scientific field. Too many errors with serious repercussions have been caused by people with only a basic training. If all you intend to do is to perform 100% HASS, base-level statisticians should be able to handle it. For production sampling and HASA, engage a well-trained expert!
RESOURCES
References:
[1] Circuit Cellar 255, October, 2011, George Novacek, Environmental Stress Screening[2] U.S. Department of Defense Military Handbook “Environmental Stress Screening (ESS) of Electronic Equipment MIL-HDBK-344A http://www.barringer1.com/mil_files/MIL-HDBK-344A.pdf
[3] Circuit Cellar 272, March 2013 George Novacek: Quality and Reliability in Design
[4] Thermotron http://thermotron.com/
Thermotron | www.thermotron.com
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • NOVEMBER 2018 #340 – Get a PDF of the issue
Sponsor this ArticleGeorge Novacek was a retired president of an aerospace company. He was a professional engineer with degrees in Automation and Cybernetics. George’s dissertation project was a design of a portable ECG (electrocardiograph) with wireless interface. George has contributed articles to Circuit Cellar since 1999, penning over 120 articles over the years. George passed away in January 2019. But we are grateful to be able to share with you several articles he left with us to be published.