Previously, when a pair of HP 3000s were felled in the aftermath of a windstorm which clipped out the power at Alan Yeo's shop, his Uninterrupted Power Supply in the mix failed as well. After a couple of glasses of merlot, our intrepid developer and founder of ScreenJet continued to reach for answers to his HP 3000 datacenter dilemma. Why did that UPS that was supposed to be protecting his 3000s and Windows servers FAIL once the power died?
By Alan Yeo
Second in a series
Feeling mellower and with nothing I really wanted to watch on the TV, I decided to take a prod at the servers and see what the problems are. I decided I'd need input to diagnose the Windows Server problem, so that could wait until the morning. Power-cycled the 917 to watched the self-test cycle and got the error, did it again. (Well sometimes these things fix themselves, don't they?) Nope, it was dead!
Google turned up nothing on the error. Nothing on the 3000-L newsgroup archives, either. I'd tell you the 3000 error code, but I've thrown away the piece of paper I had with all the scribbles from that weekend.
Where's a guru
when you want one?
I really wanted to get my 917 back up and running over the weekend, as it had all our Transact test software on it. Dave Dummer (the original author of Transact) was doing some enhancements to TransAction (our any-platform replacement for Transact) and we had planned to get some testing done for early the following week, to help a major customer.
So it's 11:30 PM UK time, but it's only 3:30 PM PDT. I wonder who's still around at Allegro? A quick Skype gets hold of Steve Cooper, who with the other Allegroids diagnose within five minutes that the 3000 has got a memory error. The last digit of the error indicates which memory bank slot has the problem.
Okay, I'm not going to start climbing around the back of the rack at this time of night. I leave it until the morning, but at least I know what the problem is.
Pulling the 3000's memory card is no problem. Working out which of the five banks is bad takes a bit more work, but a bit of plug engineering and a couple of reboots shows that we have 64MB (2x32) of bad memory. No problem, plenty left, so remove it and reboot. Great, get to the ISL prompt, do a START NORECOVERY and go get a cup of coffee and a cigarette, and I’ll soon have this system back up.
SYSTEM ABORT from SUBSYS 143
Long Story Short (or another one bites the dust)
Okay, it's about time we cut this story short — although I am certain you want to read about someone else's trials and tribulations, even as I suspect you’re only reading to find out why your UPS is useless. Suffice it to say that the 3000's LDEV 2 had also been fried, which we replaced, then the DAT drive was dead, which was replaced, but was still dead.
So in the end, we decided our fastest recovery solution was to scrap the 917 and merge its data with a 918 that had a clone in the shop. It’s a choice which makes DR recovery a lot simpler, also one less piece of kit burning electricity, that should help save the ice caps!
So what got Fried? HP 3000, Dell Intel Server, one modem, one DTC 16 -- and of course the two APC UPS's that were supposed to be protecting everything.
Why? Given that the APC “Smart” UPS's had done such a wonderful job of protecting everything, the conundrum was why they hadn't protected everything. It was time to do some research on UPS's.
It turns out there is a little bit of a clue in the three letter acronymn. The “U” stands for “Uninterruptible” not “Clean.” I discover that there are two main types of UPS: the normal Line-Interactive. Everyone makes them, everyone's got one UPS like the APC Smart UPS. Then there’s the “On-line” ones. The major difference is that standard “Smart” UPS's (most of the time) feed a mains supply out to everything plugged into it. In contrast, the on-line versions feed everything from an inverter 100 percent of the time.
But I hear you say (and as I thought) “My APC UPC filters the power, chopping down over voltage, boosting under voltage, and supplying power if the mains fails.” Well the answer in classic 3000-L mode is, “Yes, but it depends.” Now I'm no electrical expert, but I’ve worked up a layman's interpretation.
There’s something in the mix called Dirty Transfers.
Line Interactive UPS's do AVR, Automatic Voltage Regulation. Instead of going to battery during low or high input voltages, this sort of unit will use an Autotransformer to increase or reduce the voltage to a safe operating range without running on the battery. Within their stated tolerances, they can run almost indefinitely doing a number of things.
- AVR Boost, where the UPS is compensating for a low utility voltage;
- AVR Trim, when it is compensating for a high utility voltage.
- If the voltage fluctuates outside a set range, or on some of them if the rate of change of the voltage exceeds a given threshold, then they will Transfer, using the battery power via an inverter. The UPS then monitors the AC supply and when it deems it is back within tolerance it transfers back to the mains supply.
It is this Transfer Time (TT) that can cause some problems. Such as those at our shop.