MTBF, Redux
The discussion that arose following our offhand question about how those mean time between failure (MTBF) numbers are arrived at continues to spawn interesting comments. Along with several new topics (spin-up/spin-down cycles, and part count reduction), Scott Pearce from Maxtor Customer Service passes on some useful information direct from the people who deal with dead drives.
Atlant <[email protected]> writes:
One or two of the writers who have previously commented on the MTBF discussion mentioned that they didn’t think that disk drive manufacturers took spin-up/spin-down cycles into consideration when calculating MTBF numbers. They do! Last week, I was at a public presentation given by Quantum and they stated that their MTBF ratings for 3.5" (desktop class) disk drives were based on one spin-up/spin-down cycle per day. That statement is a little ambiguous – I don’t know if they meant a spin-down/up for every 8, 12, or 24 operating hours, but they clearly meant something much more conservative than "spin it up once and run it ’til it fails."
The specific context of the conversation concerned the new Energy-Star requirements and how the much shorter spin-up/spin-down cycles may affect the MTBF of 3.5" disk drives. Quantum seemed to be headed for a minimum disk spin-down timeout of two hours, lest the effect on MTBF be too great.
Jonathan Lundell <[email protected]> writes:
Another two bits from a reliability non-expert:
My company has been obliged to calculate MTBFs for a couple of large customers who required it, typically for government contracts. They provided a method for us to use, and I suspect that it’s widely used because it is simple.
The U.S. military, which is big on MTBF, has an assortment of references for different kinds of devices. In our case, these were PC boards and electronic components, but the same is probably true of mechanical devices.
Individual devices are given MTBFs (by someone – a high-ranking unnamed officer?) that tend to be very high. You calculate your product MTBF based on the reference MTBFs of its components and packaging methods.
This obviously makes no allowances for the varying quality of components from supplier A versus supplier B, but presumably you can use supplier A’s official numbers if you like.
Anyway, one reason for dramatically better claimed MTBFs is the equally dramatic reduction in parts counts. I oversimplify slightly, but it’s easy to see that if you cut the number of components in half, maintaining the same per-component MTBF, your overall MTBF roughly doubles.
Compare a five-year-old disk drive design with a new one, and you’ll see that the component count is cut by a very large factor. Note that this also reduces the number of electrical connections (solder joints, connectors), which are a significant source of failure.
There are no doubt other factors as well. Smaller disk drives have lower mechanical stresses. The trend to lower power means lower temperatures as well, which is a factor in MTBF calculations. And finally, one hopes that drive engineers learn from their failures as well, and improve their products that way.
Scott Pearce <[email protected]> of Maxtor Customer Service writes:
Maxtor finds many problems, in fact over 90 percent of failures, to be handling related. It seems that by the time drives get down to dealers and little shops they have been tossed about, no electrostatic discharge procedures have been followed, and all in all the drives have been treated badly.
Considering that drives leave the factory meeting extremely high certification tests you would expect the drives to have an extremely low failure rate in the field. But, we see a great deal of failures in the field trending towards specific volume assemblers etc. Upon investigation we find bare drives sitting on concrete floors, absolutely no electrostatic discharge protection, and so on. After educating the companies assembling the drives and fixing these issues the failure rate drops below one percent as expected.
I think it is important that people realize that drives are still as sensitive to shock and shipping damage as they were several years ago. Although you do not need to park a hard disk you must ship them in proper shipping containers and not in things like bubble wrap and sponge rubber.
The second issue is the return of damaged and failed drives for repair. As an example, a disk drive with a failed capacitor costing five cents may end up costing $200 to repair when it gets to the factory, if it was returned in poor packaging, causing the drive to suffer platter damage on return. In the end the customer pays because companies like Maxtor have to cost replacement drives at a higher rate to cover this.
The tips to remember are:
1. Never handle the drive by touching any part of the PC board assembly, even when using an anti-static strap. Pressure on the PC board assembly could crack components. Always handle the drive by the sides.
2. Never stand a drive on its side; it can be knocked down and sustain head shift or platter damage.
3. Never move a drive until it has spun down completely. Just because you cannot hear it spinning does not mean that it has completely spun down.
4. Always transport the drive in an anti-static bag, even across your office or workshop.
5. Always transport the drive in proper packaging as supplied by the hard disk manufacturer.
6. Before running a drive upside down or on the side check with the manufacturer to see if the drive can perform in this rotation. Also ask if this lowers the MTBF.
7. Always check that your power supply is well suited to the number and type of drives that are present. Some large capacity drives require as much as 15 watts to spin up. In a PC environment with an ordinary power supply this could cause undue wear on the PC board assembly components and spin motor of the drive.
8. Never touch the pins on the cable interface connector.
I hope that some of this information is useful. It seems that reliability is always being judged by failure, yet few people pay attention to the way they handle the drives.