A serious question on today's HP3000 newsgroup emerged about server reliability. The best answer came from an HP engineer whose career features more than 15 years of IO design and maintenance on hardware systems including that ultimate 3000 N-Class system. And along the way, Jim Hawkins introduced many of us to the bathtub-curve charting strategy.
It looks like a bathtub, this chart of how reliable hardware can be. High left-hand side, the part of a product lifecycle called infant mortality. Long-term youth to middle-age to early senior years, the flat, stable part of the bathtub. Finally the end of life, that sharp upswing on the right where moving parts wear out.
The question was posed to the newsgroup readers by Steven Ruffalo
I'm concerned about the reliability going forward of our N-Class servers. Are there any type of studies and metrics that could be used to determine how the failure rates of the parts on/in the N-Class will increase linear with the age of the equipment? I would imagine this would be true for any systems, but we have had an increase in processor failures over the last year. Is this coincidental, or should we start trying to stockpile additional spares?
According to Hawkins, there's been no tracking of N-Class hardware reliability by HP, which introduced the first N-Class models within a year of announcing it would be exiting the 3000 business. But he offered anecdotal, your mileage may vary, caveat emptor advice. He advised the 3000 owner that "You are in uncharted territory. Literally."
Typically reliability folks talk about the "bathtub curve" of failure rates: a high failure rate ("infant mortality), long low "stable failure rate," and an acceleration "wear-out" phase. I don't know anywhere where there is enough decent data to track long term reliability for N-Class populations at a statistical level with reasonable confidence bounds (even inside HP).
I will say anecdotally the N-Class itself was not subject to any large quality issues that I can recall. That is, I have some recollection of issues both in K/T and following rx/rp ZX1 and ZX2 systems but, while my attention may have wandered, things seem to have been pretty solid for N-Class.
(That's a reference to the K-Class and T-Class servers, known as the Series 9x9 and 9xx systems in 3000-speak.)
"I don't have any data to project when or if you'll see a rapid rise in parts replacement needs," Hawkins added, "the far side of the reliability bathtub curve."
Moving parts are the first to wear out in any computing device, but Hawkins noted that "movement includes thermal cycles through on/off switching, or even temperature swings if you don't have well-managed HVAC." There's a reasonable lifespan for everything, and those N-Class systems are at least 12 years old by now. A user might consider how long to trust a 12-year-old disc drive, and give some thought to the reliability goals for solid-state components. Burnouts were pretty rare in the stories we've heard about HP servers which run MPE/iX. For the time being, a lot of N-Class owners are enjoying HP engineering that's had a smooth bottom.