UPS Redux: Finding Gurus and a False Dawn
July 8, 2013
Editor’s note: Previously, when a pair of HP 3000s were felled in the aftermath of a windstorm that clipped out the power, a sound strategy of using an Uninterrupted Power Supply in the IT mix failed, too. After a couple of glasses of merlot, our intrepid IT manager Alan Yeo at ScreenJet continues to reach out for answers to his HP 3000 datacenter dilemma — why that UPS that was supposed to be protecting his 3000s and Windows servers went down with the winds' shift.
By Alan Yeo
Second in a series
Feeling mellower, with nothing I really want to watch on the TV, I decide to take a prod at the servers and see what the problems are.
Decide that I'll need input to diagnose the Windows problem, so that can wait until the morning. Power-cycle the 917 to watch the self-test cycle and get the error, do it again. (Well sometimes these things fix themselves, don't they?) Nope, it’s dead!
“Take out my long spoon and sup with the devil,” as they say, with a Web search. Nope, Google turns up nothing on the error, apart from a couple of old HP-UX workstation threads, where the advice seems to be “time to call your HP support engineer.” Nothing on the 3000-L newsgroup archives, either. (I'd tell you the 3000 error code, but I've thrown away the piece of paper I had with all the scribbles from that weekend).
Where's a guru
when you want one?
I really wanted to get the 917 back up and running over the weekend, as it had all our Transact test software on it. Dave Dummer (the original author of Transact) was doing some enhancements to TransAction (our any-platform replacement for Transact) and we had planned to get some testing done for early the following week, to help a major customer.
So it's 11:30 PM UK time, but it's only 3:30 PM PDT! I wonder who's around at Allegro? A quick Skype gets hold of Steve Cooper, who with the other Allegroids (interesting, my spell checker thinks Allegroid is a valid word) diagnose within five minutes that the 3000 has got a memory error. The last digit of the error indicates which memory bank slot has the problem.
Feeling refreshed, let's get these hardware problems sorted. Get the Windows server booted with “Hirens Boot CD” magic set of tools for fixing loads of stuff. Diagnoses that there are a couple of missing .DLL's. Okay, patch them in, still problems! seems to be a hall of mirrors every time we patch something in, the next missing file is found. This could go on for ages.
Try various Windows recovery reinstalls, but they all fail, Windows 2003 doesn't think it's installed, but would happily install if I let it reformat the hard drive. Not the recovery I was looking for. Run some disc-checking utilities and basically whilst the disc checks out okay, the file directory (or whatever it's called) is smashed. Do we spend a lot of time rebuilding a Windows system that's only running one piece of software that should have been moved off anyway? Simple choice, no. Leave it to my co-worker Mark to figure out what to do to get mail flowing again, whilst I take a look at the 917 memory problem.
Pulling the memory card is no problem. Working out which of the five banks is bad takes a bit more work, but a bit of plug engineering and a couple of reboots shows that we have 64MB (2x32) of bad memory. No problem, plenty left, so remove it and reboot. Great, get to the ISL prompt, do a START NORECOVERY and go get a cup of coffee and a cigarette, and I’ll soon have this system back up.
SYSTEM ABORT from SUBSYS 143
Long Story Short (or another one bites the dust)
Okay, it's about time we cut this story short — although I am certain you want to read about someone else's trials and tribulations, even as I suspect you’re only reading to find out why your UPS is useless. Suffice it to say that the 3000's LDEV 2 had also been fried, which we replaced, then the DAT drive was dead, which was replaced, but was still dead.
So in the end, we decided our fastest recovery solution was to scrap the 917 and merge its data with a 918 that has a clone in the shop. It’s a choice which makes DR recovery a lot simpler, also one less piece of kit burning electricity, that should help save the ice caps!
So what got Fried? HP 3000, Dell Intel Server, one modem, one DTC 16 -- and of course the two APC UPS's that were supposed to be protecting everything.
Why? Okay, okay, I've finally got around to the Meat and Potatoes bit. Given that the APC “Smart” UPS's had done such a wonderful job of protecting everything, it didn't seem much point sending them off anywhere for repair and putting them back into service. Also, I needed to get some replacements in ASAP. But the conundrum was why they hadn't protected everything as had been my expectation, so it’s about time to do some research on UPS's.
It turns out there is a little bit of a clue in the three letter acronymn. The “U” stands for “Uninterruptible” not “Clean.” I discover that there are two main types of UPS: the normal Line-Interactive. Everyone makes them, everyone's got one UPS like the APC Smart UPS. Then there’s the “On-line” ones. The major difference is that standard “Smart” UPS's (most of the time) feed a mains supply out to everything plugged into it. In contrast, the on-line versions feed everything from an inverter 100 percent of the time.
But I hear you say (and as I thought) “My APC UPC filters the power, chopping down over voltage, boosting under voltage, and supplying power if the mains fails.” Well the answer in classic 3000-L mode is, “Yes, but it depends.” Now I'm no electrical expert, but I’ve worked up a layman's interpretation.
There’s something in the mix called Dirty Transfers.
Line Interactive UPS's do AVR, Automatic Voltage Regulation. Instead of going to battery during low or high input voltages, this sort of unit will use an Autotransformer to increase or reduce the voltage to a safe operating range without running on the battery. Within their stated tolerances, they can run almost indefinitely doing a number of things.
- AVR Boost, where the UPS is compensating for a low utility voltage;
- AVR Trim, when it is compensating for a high utility voltage.
- If the voltage fluctuates outside a set range, or on some of them if the rate of change of the voltage exceeds a given threshold, then they will Transfer, using the battery power via an inverter. The UPS then monitors the AC supply and when it deems it is back within tolerance it transfers back to the mains supply.
It is this Transfer Time (TT) that can cause some problems. Such as those at our shop.
In the finale: Keeping it clean, and learning you're an HP customer once again.