Friday Fine-Tune: Driving Filesystem Checks
December 9, 2016
In the middle of a full backup, the HP 3000 at James Byrne's shop came to a system halt at 3 AM. It was the kind of halt that puts up those puzzling abort messages not even HP has fully documented. For example, about SA 1458, Robelle's Neil Armstrong said, "My experience with SA 1458 is that it is a catch-all abort. You need to look at the subsystem information and the only way to truly know the root cause would be to get a dump and analyze it." He referenced a webpage that breaks down the process of doing 3000 system failure analysis, too.
When a halt occurs during a backup, there's always the chance the 3000's filesystem has been injured. "I'd say run FSCHECK.MPEXL.TELESUP and check your filesystem," said Keven Miller of 3k Ranger. He added that a former HP support expert, Lars Appel, "instructed me that System Abort messages are in subsystem 98. From the MPE Error Messages Volume 2, Chapter 4, System Aborts, 1458 MESSAGE means A critical process is being terminated due to a trap."
Sure enough, power interruptions at Byrne's shop introduced damage to an Image database.
We reached the point last summer were we were toying with going off-grid simply to avoid the repeated power interruptions. If this sort of thing is causing damage then we will have to consider it. And it seems to; we now have a broken backward chain in one of our Image databases. A thing that I cannot ever recall. Coincidence? We are doing a backup, and then I will be using Adager to go in and take a look.
FSCHECK is an included tool on the 3000. It's simply there to validate extents and scan the table cache for missing files. Better tools include not only the legendary Adager, but independent support suppliers for the 3000 owner. People who know the fast commands of tools like CSTM. "What does your support provider say?" asked one support vendor. Self-support can be backed up by 3000-L questions. Some of the advice about the halt even ran to looking at memory issues. A provider can help eliminate these possibilities.
I am simply trying to find out if there is any way of examining whether or not we actually have a failing drive. We have spares but if there really is no need then I would rather not take the system down again after such a short interval. It has been a bad fall for our poor old 918. The system HDD was toasted by a whipsaw set of power outages on October 11; now our data disc is suspected of being ready to let go as well.
While running FSCHECK, Byrne was advised to use the commands
Check all Dev=all
FSCOUNT / 10000
He also received instruction from Mark Ranft on how to scan logs using CSTM to find disk errors.
Sign on as Manager.SYS. Do a LISTF LOG####,2 to find the start and ending Log files. Alter the log file number range in the commands and enter the commands in LOGTOOL.
list log=3404/3477 type=111 "device class"="hard disc",da,ca,"bus converter" out=LogOut1
list log=3404/3477 TYPE=111;'MGR CODE'= 241,242,900,901,951 out=LogOut2
The results are written to two files LOGOUT1 and LOGOUT2. I had this set up to run weekly on my systems. And if the files had errors in them, the job would email the results to me for review. You will see errors due to SCSI or FC resets on every boot, so check the timestamps to tell if the boot caused the error.
At Byrne's shop the Series 918 was recovered after Adager did its repairs. "During the Adager repair an infrequently occurring high-pitched but low volume sound was noted," he said. "This appeared to emanate from the 3000. We have not heard it since the Adager repair completed. A suspicion arose that we might have a disk about to lose a bearing."
Once the recovery was complete and the disk replaced, the trademark wisecracks of long-time 3000 vets began to arise. The sound during Adager's repairs "would have been the sound of the chains being dragged around the disk to put them back straight," said Alan Yeo, "and possibly the sound of a very tiny virtual Alfredo welding a few broken links back together."
Byrne noted that the problems with the Series 918 disks "started I was working on installing the 9.6 version of PostgreSQL on a new FreeBSD host. I wonder if the HP 3000 is throwing a temper tantrum? Naaah. Cannot be."