Checkup Tips to Diagnose Creeping Crud
March 26, 2015
When an HP 3000 of the ultimate generation developed trouble for Tom Hula, he turned to the 3000 newsgroup for advice. He'd gotten his system back up and serving its still-crucial application to users. But even after a restart, with the server looking better, things just didn't seem right to him.
I am concerned, since I don't know what the problem was. It almost reminded me of something I used to call the Creeping Crud, where people started freezing up all over the place, while some people were still able to work. The only thing was a reboot. But in this case, it seemed worse. Only a few people on our 3000 now, but we still depend on it for a high-profile application. What should I check?
The most revealing advice came from Craig Lalley, who told Hula he'd try a Control-B into the 3000's system log. The steps after the Control-B command are SL (for System Log) and E (for Errors only.) Typing CO puts the 3000 back in console mode. Hula's system had lost its date and time on one error, and the Alert Levels showed a software failure along with lost boot functionality.
But amid the specifics of eliminating the Creeping Crud (it may have been a dead battery) came sound advice on how to prepare for a total failure and where to look for answers to 3000 hardware problems. The good news on the battery is that it's not in a Series 9x7. Advice from five years ago on battery replacement pointed to a hobbyist-grade workbench repair. More modern systems like Hula's A400 at least have newer batteries.
I had a system acting strangely this past weekend. It was basically hung but allowed new logons. I could not abort anyone. When I got to the point where I tried to stop the network, I got a system abort 1458 from Subsystem 102. I didn't bother to take a dump. I completed the boot and everything was better.
Chuck Trites reminded Hula to create a current CSLT tape, and "run BULDJOB to create the BULDJOB1 and 2 files — in case you need to recreate the accounting structure and UDCs — and store them to tape."
Hula's own check list included the following:
During the reset, the 3000 got up to the date and a little past and seemed frozen. I pulled the plug and restarted again. It took 2-3 times as long as normal and at first, the red fault light was on (I never saw that on before). After it got a bit into the restart, the fault light turned off by itself. The only attention message I got about the whole thing was a message with everything unknown on the 3000.
When the computer came all the way up, it still seemed very sluggish. I scheduled the nightly update and backup and went home to look at it more in the morning. I logged on from home and the backup seemed to be running okay.
This morning I tried resetting the GSP and checked the connections to the console terminal. I also found out that someone else had a hard time getting on the 3000 towards the end of the day. Very sluggish. But this morning, everything seems back to normal.
Hewlett-Packard's hardware builds have been extraordinary, but a server that's been churning out critical data for more than 12 years (A-Class boxes production stopped in 2003) can develop crud. Something as simple as replacing a dead battery might be the answer to the woes. Advice for the crud also came from Gilles Schipper, Jack Connor and the others mentioned. What they've got in common is working in a support practice, or at least a consulting business that includes 3000 sites.
Self-maintenance is common in a community like the 3000's. It's also a good practice to have a support vendor, one who knows the system as well as the volunteers posting to the newsgroup.