Editor's note: Intrepid veteran developer Alan Yeo of ScreenJet in the UK had a pair of HP 3000s felled recently, despite his sound strategy of using an Uninterrupted Power Supply in his IT mix (or "kit," as it's called in England). In honor of our fireworks-laden weekend here in the US, we offer Yeo's first installment of the rescue of the systems which logic said were UPS-protected. As Yeo said in offering the article, "We're pretty experienced here, and even we learned things through this about UPS." We hope you will as well.
New UPS Sir!
"Would you like fries with that?"
By Alan Yeo
First of a series
"Smart UPS" now has a new meaning to me. "You're going to smart, if you're dumb enough to buy one" I guess this is one of those stories where if you don't laugh you'd cry, so on with the laughs.
By the end of this tale, you should know why your UPS may be a pile of junk that should be thrown in the trash. And what you should replace it with.
A Friday in early June and it was incredibly windy. Apparently we were getting the fag end of a large storm that had traversed the Atlantic after hitting the US the week before. Sort of reverse of the saying "America sneezes, and Europe catches a cold." This time we were getting the last snorts of the storm.
Anyway, with our offices being rurally located, strong winds normally mean that we are going to get a few power problems. The odd power blip and the very occasional outage as trees gently tap the overhead power lines. Always worst in the summer, as the trees are heavily laden with leaf and drooping closer to the lines than they are in the winter, when they come round and check them.
So this situation is not normally something we worry about. We are fairly well-protected (or so we thought) with a number of APC UPS units to keep our servers and comms kit safe from the blips and surges. The UPS units are big enough so that if the power does go out, we can keep running long enough for either the power to come back -- or if we find out from the power company that its likely to be a while, for us to shut down the servers.
We keep all the comms kit, routers, switches, firewalls and so forth on a separate UPS. This UPS will keep them running nearly all day, so that way we still have Internet access, Web, email and more, so can keep functioning, as long as the laptop batteries hold out.
It's not dead, its just
sleeping after a long squawk!
Humm… First I thought it must be the overload switch, so disconnected all the load, grovelled around behind it and pressed the reset switch. Nothing. So I disconnect from the mains, reset, power it back on, nothing. Check the fuse in the plug, all okay, its still dead. Dig out the APC manual, whose symptoms say "don't use, return to your supplier for service."
At this point the power goes completely for 10 minutes, and as I can see that the server UPS batteries are already half empty (or half-full if you're an optimist). "They must have been taking more of a load during the morning than I thought," I say to myself. I decided it was time for a controlled shutdown of the servers, which I did. Now I was going to have to rejig the power cables, so that we could feed power to the comm's kit (which was now on a dead UPS) from the server's UPS. A couple of minutes of work commenced, to move their supplies to spare outlets on the APC Switched Rack PDU that is fed by the UPS. The PDU is a network-addressable Power Distribution Unit, one that can power up/down individual power outlets, and thus we can remotely shutdown or reset the servers if needs be.
So at this point the power comes back, and I power up the comm's kit, leaving the servers off. Decide I'll go for lunch, let the batteries recharge a bit, and make sure that the power is staying on before I restart the Servers.
Lunch passes, with a glass of Merlot.
Now the power seems to be stable, so it's back to the computer room to bring up just the essential servers. Our main HP 3000 test server. A Windows mailserver, and a Windows file server that also handles our VPN connections (because everyone works remotely now).
I'm in the middle of this when the power goes out again. I look at the PDU which tells me that we are drawing 3 amps (240v * 3 = 720 watts) = about 10 minutes worth on a half-charged 2200VA UPS. Not worth it, so I shut the servers down (but I don't throw their power switches).
At this point the power comes back and stays on for about five minutes. There's me standing there trying to decide what to do, when the power goes off again, and then comes back. At which point the sole remaining UPS goes BANG! It flashes its lights a bit whilst beeping manically, and then goes dead. The room fills with the smell of over-heated insulation, so I pull the UPS power plug.
Okay, "Sod this for a bunch of Soldiers," thinks I. Was going to finish early that day to help some friends set up for a weekend Charity Clay Shoot. "I'll go now and come back later -- when hopefully the wind has died down and the power is back to normal -- and then pick up the pieces."
Back in the datacentre at 8 p.m. and the wind is gone, with power back to normal. Okay, should just have time to get everything working before dinner. Play with the UPS for 10 minutes, but it's dead. So we are going to have to "walk the tight rope without safety harness or net" and run everything direct from the mains.
Not exactly completely unprotected computing, because when we had had the new office wired 18 months ago, we installed surge protection on the mains supply. Its like a couple of cartridges that sit next to the distribution panel that absorb a surge, decaying in the process, until the point they need replacing. They have a status indicator on them telling you if they need changing, but they were showing green, so I thought I'd risk it for a few days, until we could source a new UPS.
Why do these things always hit at a weekend?
Comms come back okay, although I noticed that an old dial up modem was dead that was still hooked up for dire emergency remote access if Internet access failed. Okay, now for the servers: power up the Series 917 and let it start its self test check (which takes ages, and lots of memory); power up the Series 918 (it does its memory tests much quicker); power up the Windows 2008 file server and a Windows mail database server. Plus, an older Windows 2003 server that still ran the SMTP software, which should have been moved to the 2008 server, but hadn't because we had never got around to it.
The HP 3000 918 comes up clean, the Windows 2008 server comes up, the Windows mail database server comes up. But HP 3000 917 is downed with an FLT error, the Windows 2003 Server is looping around boot start-up into Windows launch, then straight back to boot start-up. Wonderful! Sod it, go and have dinner and decide if I'm coming back later.