British Prime Minister Benjamin Disraeli (1804–1881) once uttered: “There are lies, damned lies, and statistics.” I don’t say statistics lie, but not everything presented to us as a result of statistical analysis is necessarily true. Statistics are instrumental for investigation of many scientific and social subjects, but they’re just a tool. Incorrectly used, their results can be wittingly or unwittingly skewed and potentially used to prove or disprove just about anything.
One use of statistics in engineering is to predict product reliability, a topic I’ve addressed in several of my previous columns. In this article, I’ll investigate it further.
Predicted reliability is a probability of a product functioning without a failure for a given length of time. It is usually presented as a failure rate λ (Greek letter lambda) indicating the probable number of failures per million hours of operation. Its reciprocal Mean Time Between Failures (MTBF) or Mean Time to Failure (MTTF for irreparable products) is often preferred because it is easier to comprehend.
Having determined a product’s reliability, we can establish its criticality, the likely warranty and maintenance costs as well as to plan repair activities with spare parts quantities and their allocation. Figure 1 is the ubiquitous reliability bath tub curve. The subject of our discussion is the period called “Constant Failure Rate.”
Figure 1: Reliability bath tub curve.
Predicted reliability is not a precise number. It is a probability with less than 100% certainty that the product will work for the specified time. Unfortunately, many buyers, program managers, even design engineers do not recognize this and expect the predicted failure rate to be a certainty.
Most engineers aren’t expert statisticians, nor can every design organization afford such specialists on staff. Historical data to perform reliability prediction are rarely adequate in small companies. Luckily, the US military made the reliability prediction easy to calculate by following their Military Handbook: Reliability of Electronic Equipment (MIL-HDBK-217), now in version F (www.sre.org/pubs/Mil-Hdbk-217F.pdf), based on data collected over many years.
The general method for calculating failure rate during the development stages is by parts count which, once all design details are known, is refined by parts’ stress analysis.
With the parts count method, the predicted failure rate is:
In this equation, λEQUIP is the total equipment failures per million hours. λg is the generic failure rate for ith generic part. πQ is the quality factor for ith generic part. Ni quantity of ith generic part. n is the number of different generic part categories in the equipment. Using a spreadsheet, for example, you follow the Military Handbook by calculating failure rates for individual components.
A typical component failure rate model is:
In this equation, λp is the part’s failure rate. λb is the part’s base failure rate. p are factors modifying the base rate for stress, application, environment, and so on.
The Military Handbook provides tables with statistical data for components’ base failure rate and all the pertinent p factors. It is one of several methods for calculating predicted reliability. I was interested to see how some different statistical methods compare with each other. For my inquiry I selected an Arduino Uno, which many readers are familiar with (see Photo 1). Not having the Arduino’s design details, I estimated operating conditions. Because I used the same estimates for all the calculations, the relative comparison of the results is valid, while the absolute value may be somewhat off.
Photo 1: Arduino Uno
I have a number of Arduino boards, Duemilanove and Uno, which are quite similar, and through the years have collectively clocked over 600,000 hours of continuous operation without a single failure (including all their peripheral circuits which are not, however, included in my calculations). Several of my Arduinos have worked in the garage with the recorded ambient temperature ranging from –32°C (–25.6°F) to 39°C (102.2°F), which falls under the military category “ground fixed” (GF) environment. I used this for my calculations. While components’ failure rates generally increase exponentially with temperature, operation below about 50°C (122°F) does not significantly increase the stress over a room temperature, so there would not be a significant difference in failure rate between those Arduinos working in the garage and those inside the house.
I performed two manual calculations according to MIL-HDBK-217F and one per HDBK 217 Plus using Mathcad. The failure rate of Item 1 in Table 1 was calculated per MIL-HDBK-217F for commercial parts. The MIL handbook has been criticized for two shortcomings. First, for the unrealistic penalty on commercial parts by today’s quality standards as compared to the military-screened parts. This is understandable in view of the second shortcoming: that the handbook has not been updated since 1995. At that time, component manufacturers discontinued screening their parts for military compliance, but simultaneously and up to the present, have been improving quality of manufacturing processes, thus increasing components’ reliability. Based on experience, I modified the components’ quality factors and the result is shown by Item 2.
Table 1: Calculated reliability of Arduino Uno
There are a number of commercially available programs to aid in reliability prediction calculations. You only need to input the bill of material (BOM) with the components’ operating specifics, such as the temperature, derating, and so forth. The programs have built-in databases and do the calculations for you. Those programs are powerful, and in addition to the predicted failure rate, they can generate analyses such as Failure Mode and Effects Analysis (FMEA), Fault Tree Analysis (FTA), and others. Sadly, they are not inexpensive.
As there are different methodologies for calculating the predicted failure rate, these programs allow you to select which methodology you want to use. I performed 10 calculations for the same Arduino BOM and operating conditions by three computer programs per MIL-HDBK-217F, HDBK 217 Plus, Bellcore, Telcordia, Siemens, PRISM (which I believe is based on HDBK 217 Plus), and FIDES 2009 methodologies.
The results fall into two fairly consistent groups, roughly an order of magnitude (discounting Item 1) apart. Table 1 lists the calculated results and Figure 2 its graphic representation. Figure 2 displays the MTBF rather than the failure rate, which is easier to visualize.
Figure 2: Results tabulated in Table 1 shown graphically output.
As I already said, I discounted Item 1 which only demonstrates the MIL-HDBK-217F obsolete bias towards commercial parts. Items 3 and 4 show that the MIL-HDBK-217F implementation by commercial programs and my own adjustment, Item 2, are reasonably close. So are items 8, 10, and 13. These methodologies are geared towards tough military/aerospace applications and, therefore, I suspect, their statistical treatment is more conservative than that of Items 5, 6, 7, 11, and 12, which show the predicted MTBF up to more than a decade greater.
We should remind ourselves what those MTBF numbers mean. One hundred thousand hours MTBF represent 11.5 years of continuous operation! That’s a long time. It would be 23 years if operated only 12 hours daily. Many products become obsolete or fall apart before then. Consequently, I am rather skeptical about the predicted 578 years MTBF of Item #11.
Bellcore methodology was developed for telephone equipment, FIDES 2009 is the result of the efforts of the European aerospace manufacturers and the results are close to MIL-HDBK-217F. HDBK 217 Plus, Telcordia, Siemens and PRISM provided results by an order of magnitude greater.
It is important to strive for a realistic predicted failure rate because other analyses, not to mention design and manufacturing costs, warranty and spare parts allocation are affected by it. Later refinements of the calculated prediction by stress analyses, reliability testing and especially field experience should bring us close to the real value.
CHOOSE A METHOD
Which methodology should you use? My customers have always required MIL-HDBK-217, so I never had the headache of having to make and then justify my own choice. Despite its age, MIL-HDBK-217 continues to be alive and well to this day in the aerospace and military industries. In comparison with the results of Items 5, 6, 7, 9, 11, 12 MIL-HDBK-217 seems rather conservative, but when it comes to safety critical designs I much rather err on the safe side. My experience with the Arduinos doesn’t provide sufficient data to draw a general conclusion, although it appears to be better than the MIL-HDBK-217 predicts.
Ultimately, the field results will be the only thing that counts. All the calculations will be meaningless if the product’s reliability in field is not as required by the customer’s specification. Then you will be called upon to fix the design or the manufacturing process or both.
Reliability calculations differ from methodology to methodology, but if you are consistent using the same methodology, those numbers, coupled with experience, will enable you to judge what the field results will be. All my MTBF calculations met the specifications, but the field results have always exceeded those calculations. And that’s what really counts in the end.
George Novacek is a professional engineer with a degree in Cybernetics and Closed-Loop Control. Now retired, he was most recently president of a multinational manufacturer for embedded control systems for aerospace applications. George wrote 26 feature articles for Circuit Cellar between 1999 and 2004. Contact him at email@example.com with “Circuit Cellar” in the subject line.