Ops check: does a replacement application do the same caliber of power fail recovery?
May 21, 2014
Migrating away from an HP 3000 application means leaving behind some things you can replace. One example is robust scheduling and job management. You can get that under Windows, if your target application will run on that Microsoft OS. It's extra, but worth it, especially if the app you need to replace generates a great many jobs. We've heard of one that used 14,000.
A migrating site will also want to be sure about error recovery in case of a system failure. Looking at what's a given in the 3000 world is the bottom-rung bar to check on a new platform. This might not be an issue that app users care about -- until a brown-out takes down a server that doesn't have robust recovery. One HP 3000 system manager summed up the operations he needs to replace on HP's 3000 application server.
We're looking at recovery aspects if power is lost, or those that kick in whenever MPE crashes. On the 3000's critical applications, we can use DBCONTROL or FCONTROL to complete the I/O. Another option would be to store down the datasets before the batch process takes place.
A couple of decades ago, this was a feature where the 3000's IMAGE database stood out in a startling, visual way. A database shootout in New Jersey pitted IMAGE and MPE against Unix and Oracle, or second-level entries such as Sybase or Informix. A tug on the power plug of the 3000 while it was processing data left the server in a no-data-loss state, when it could be rebooted. Not so much, way back then, for what we'd call today's replacement system databases.
Eloquence, the IMAGE workalike database, emulates this rock-solid recovery for any Windows or Linux applications that use that Marxmeier product. Whatever the replacement application will be for a mission-critical 3000 system, it needs to rely on the same caliber of crash or powerfail recovery. This isn't an obvious question to ask during the feature comparison phase of migration planning. But such recovery is not automatic on every platform that will take over for MPE.
It is possible that the write to the journal on the disk is delayed, because it's more efficient from the head position currently to write in a different order to the one the operating system requested as the actual order -- meaning blocks can be committed before the journal is.
The way to resolve this is to make the operating system explicitly wait for the journal to have been committed before committing any more writes. This is known as a barrier. Most filesystems do not use this by default and would explicitly need enabling with a mount option.
mount -o barrier=1 /dev/sda /mntpnt
The big downside to barriers is they have a tendency to slow IO down, sometimes dramatically (around 30 percent) which is why they aren't enabled by default.
In the 3000 world, logging has been used as a similar recovery feature, focused on recovering IMAGE data. A long-running debate included concerns about whether logging penalized application performance. We've run a logging article written by Robelle's Bob Green that's worth a look.
Peering under the covers of any replacement application, to see the means to recover its data, is a best practice. Even if a manager doesn't have deep knowledge of the target environment, this peering is the kind of thing the typical experienced 3000 manager will embrace without question. Then they'll ask the powerfail recovery question.