Adam Engst 11 April 1994

MTBF Revealed

We commented last week that we didn’t know how that increasingly large mean time between failure (MTBF) number is calculated for hard drives. Luckily, the beauty of having so many knowledgeable readers (and a few comedians) is that we receive good answers to such questions.

Luca Accomazzi <[email protected]> suggests helpfully:

I may have an answer about how to determine a MTBF of 500,000 hours. Buy 500,000 drives and turn them on. Watch and see how many die in one hour. If only one does, then you have a MTBF of 500,000 hours. Check to make sure that you have an ample supply of plugs before you try that at home, though.

E. Warwick Daw <[email protected]> writes:

I don’t know for sure how drive manufacturers do it, but the methods usually used for things like this are called "survival analysis." If you really want to know all the details, I’d suggest looking in the statistics section of your library for the theory, and then the industrial engineering and medical research sections for the applications. The basic idea is that you take 1,000 hard disk drives, turn them on, and let them run. Now, since MTBF is only an average, even if you have a very long MTBF, if you have enough drives, you will expect to see a few failures relatively quickly. So, say 5 of your 1,000 drives fail in six months of testing. You take your data and, making certain mathematical assumptions, you fit a "survival curve" (a plot of the number still working vs. time) to it. From this curve, you can calculate a predicted MTBF based on your mathematical assumptions.

As a pure mathematician turned applied mathematician, I am quite skeptical of claims of a half-century MTBF based on a few months of testing. IMHO, the usual mathematical assumptions about how complex mechanical devices wear out just don’t apply over decades, but I can’t really judge the methods used by the drive manufacturers without examining them closely.

In any case, although I take MTBF ratings with a grain of salt, I do still consider them useful, and, all other things being equal, I would get the drive with the higher MTBF. Just realize that what the MTBF represents is the chance that the drive will crash in the first year you have it, and not the total lifetime of the drive.

Caesar Chavez <[email protected]> explains:

I have been out of the "reliability business" for a few years. But I may be able to provide you with some information regarding disk reliability.

Disk drive reliability from the mid-70s to the mid-80s improved by a factor of five, from 10,000 hours to 50,000 hours. I didn’t realize that their reliability has improved so dramatically in the last few years. You are right in that engineers have discovered a way of testing hardware in an accelerated way.

In order to derive MTBF numbers, some assumptions are normally made. First of all, a level of ambient temperature is assumed, usually room temperature with cooling and/or fans providing air flow. Second, the devices are not turned off and on often. Third, mechanical parts are guaranteed to be lubricated properly. Fourth, oftentimes the devices are "burned in," which means they were run while cycling power and temperature for a time in order to "shake out" weak devices, defining the term "infant mortality." Fifth, a semiconductor part is assumed to receive all signals and power within narrow tolerances. Under these ideal conditions, a manufacturer can provide an MTBF number for a device.

These specifications provide the key for accelerated testing. Military and aerospace standards, which by necessity require extremely high reliability numbers, typically state that for each 10 degrees of temperature rise, parts will fail at some extrapolated rate. If memory serves me correctly, if mechanical or semiconducting devices specified to be operated at 20 degrees centigrade are operated at 50 degrees centigrade instead, they will fail three times or eight times as often respectively. An electromechanical device such as a disk drive, under elevated temperature, will fail at a much higher rate weighted by the amount of electrical versus mechanical parts contained in it. Therefore, reliability numbers may be derived by running a device at an elevated temperature for a much shorter period of time than would normally be required in order to generate failure rates under normal operating conditions. In addition, power cycling may be used to accelerate failures; sometimes signals and power input or output may be operated outside of normal manufacturer-specified operating conditions. Application of these failure-inducing processes to MTBF rates is called "derating" a part under stress.

NASA and the Department of Defense have spent billions of dollars and years to verify their conclusions. As you stated, for the normal, non-military user, if a device is run under normal operating conditions in terms of temperature, power, and power cycling, quality commercial-grade disk drives should last for a long time.

John Woods <[email protected]> confirms:

In most cases, the manufacturers run their MTBF tests at elevated temperatures and voltages, having determined through empirical tests the relationship between how fast you accelerate the failure of key parts if you exceed the specs by just how much. They also do some analysis from the MTBF of individual components (sometimes learned from the previous method) and calculate the system MTBF accordingly. Some manufacturers may be just guessing, though…

I pay much more attention to the warranty period than to the MTBF, since the warranty period isn’t a guess or a statistical prediction, it’s a promise. A 57-year MTBF coupled with a 1-year warranty sounds as though the company in question isn’t all that sure of its MTBF figure.

Rich Straka <[email protected]> provides more details:

First, a little explanation on failures. There is a general concept of failures that breaks them up into three categories:

Infant mortality – Manufacturing defects, DOAs, and so on. These are things like wire nicks, poor soldering, etc. Basically, we’re talking about manufacturing anomalies that should fail within the warranty period.
Wearout – Simple, known processes which degrade something. Common examples include muffler rust-through, auto body rust, etc.
Everything else – (I forget if this has a more proper term.) These are random failures of parts which are already past their infant mortality ("burned-in"), but not yet at that wearout stage. This is the kind of failure that MTBFs are based upon.

The "Bathtub Curve" is a plot of the general failure rate of some component or system:
    Failure rate

        Infant           Everything              Wearout
        Mortality          Else

  High  |\                                         /
        | \                                       /
        |  \                                     /
        |   \___________________________________/
  Zero  |___________________________________________________
        0
                            Time
MTBFs — System MTBFs are tricky things to begin with. I would assume that there are all sorts of ways of coming up with them. Their reliability as a measure of quality is highly dependent on the ethics of those who determine them and quote them.

One way is to measure the failure rate by firing up a lot of units and waiting a long time for failures to occur. Infant mortality is not counted (for obvious reasons). Wearout failures are not usually counted either. For example, muffler MTBF is relatively low (if, indeed anybody even considers such a figure), but muffler wearout is relatively common and predictable. These are not the same things!

Another way is to come up with a composite MTBF, comprised of the individual MTBFs of all of the components of the system. I’m not up on the math typically used for this assessment. Each of the components, of course must have a properly assessed MTBF.

For any MTBF, operating environments (temperature, voltage, etc.) must be specified. For hard disks, it’s not clear if they ever power cycle them, for instance. I suspect not, and that’s the subject of another conversation.

Accelerated Testing — Instead of waiting around for failures, it is possible to characterize a type of failure (electromigration, sodium contamination, etc.) of individual components based on operating temperature.

A Swedish chemist and physicist by the name of Arrhenius developed an equation stating that many chemical and physical processes are governed by temperature, where the speed of reaction of a process is proportional to the natural antilog (e to the power) of some constant times the absolute temperature.

In order to determine the acceleration of the reaction rate of a process, you calculate the rates for the two temperatures of interest and divide them. The actual numbers are of little interest, the ratio is what is important here.

This constant is known as the device’s "activation energy," which is specified in units of electron-volts. Common values are 0.7 – 0.9 eV, which is a big range (being up in the exponent).

Most folks in the quality business do tests (testing failure rates at different temperatures) to determine a device’s activation energy.

With this information in hand, they can then test devices at high temperatures to simulate long service times. They calculate the acceleration factor for a particular temperature from the Arrhenius equation, enabling them to test many years’ worth of wear in just a few weeks. This is how we used to test the data retention parameters of EPROMs back in the late 70s.

Share

Subscribe today so you don’t miss any TidBITS articles!