By Scott Hirsh
It’s sad but true: we system managers won’t cut ourselves any slack. We repeatedly put ourselves in jeopardy, often making the same mistakes time after time. We even break all the rules we impose on others. Don’t believe me? See if you recognize any of these examples.
1. Hand crafted system management
Ah yes, the good old days. Peace, love and tear gas (I never inhaled). But here’s a news flash, sunshine: for system managers, the ’60s are dead. Predictable, repeatable tasks can and should be automated. If you can script it, you can schedule it. And if you can schedule it, you can automate it. So what are you waiting for? Do you like (take your pick): streaming jobs by hand; adjusting fences and priorities by hand; reading $STDLISTs; staring at the console waiting for that one important message? For this you went to college?
And yet, we (or our management) come up with lots of lame excuses for running a stone-age operation. Can’t afford the automation products, don’t trust automation, can’t trap every error, blah blah blah. Those excuses may fly when you’re small, but suddenly you have more systems, bigger systems and manual management turns your shop into burn-out central. Now there’s turnover costs, downtime costs, opportunity costs.
Oh, and by the way, it’s much more expensive to implement automated management in a large, busy environment than it is to grow automated management from a smaller environment. Perhaps some of us are just adrenaline junkies, or we fear not being needed. Get over it and automate already.
2. The disappearing act
A close personal friend of mine — okay, it was me — once made a change to Security/3000’s SECURCON file, then left for an all-day meeting about 40 miles away. Guess what? None of the application users could log on after my change. Way back then, my pager almost vibrated off my belt from that one. And it made for some interesting meetings when I got back.
I have seen lots of cases where a system manager made a configuration change, installed a patch, or fussed with SYSSTART or UDCs, then immediately went home. Big mistake. If you’re lucky, you live near your data center and can zip right back to repair the carnage that was discovered right away. If you’re not lucky, first you don’t discover your mistake until the worst possible moment — say, around the heaviest usage period the next day — and then you’re forced to take the system down to fix the problem. Ouch.
3. A lack of planning on my part does constitute an emergency on your part
A variation on No. 1. We are the eternal optimists. No matter how invasive the procedure, everything will work out perfectly, right? How many PowerPatches must we install before we realize we must leave adequate time for testing the patched system and perhaps back that sucka out? No really, this time HP (or your favorite vendor) has learned from past mistakes and has a bullet-proof update. No need to leave a cushion for collateral damage. Right.
Every decent system administration book offers the same advice: Don’t do anything you can’t undo. Make a backup copy of whatever you’re changing. Keep track of the steps you followed. Be prepared to back out whatever you’re doing. Because that contingency time can inflate your update schedule by hours, it’s unlikely you can safely make a system change at any time other than weekends or holidays.
4. I’ve got a secret
You make changes but don’t tell anyone about them. Let’s be charitable and say your changes worked as planned. Unfortunately, nobody knew you were going to make the change. I have seen a change as innocuous as modifying the system prompt have unintended consequences (Reflection scripts looked for the old prompt and now wouldn’t work). The term “system” implies interrelationships. Anything we do has a ripple effect. When we don’t tell others that we’re about to make a change — “they wouldn’t let me do it if I told them!” — we don’t do ourselves any favors. I would love to hear other war stories under this category (hint, hint).
This probably explains all the peripherals you’ve bought that don’t work with your HP 3000. But isn’t the HP 3000 the most open system in the universe? A disk drive is a disk drive, right? The vendor told me the printer would work (and it costs much less than that HP printer). We do love our work, don’t we? And we do get excited by all the possibilities of the technology.
But sometimes — most times? — when the opportunity looks too good to be true, it is. And what a hassle it is when we’re stuck with a device, bought and paid for, that we must get to work with our system. Now. Because we’re out of space. Because the CFO doesn’t like spending $25K for a big paperweight.
Another aspect of this issue arises with replacement parts. No names please, but I have seen systems with non-certified disk drives. Sure they work — until there’s a power failure. The customer didn’t know they had this exposure because their maintenance company didn’t think it was worth mentioning. Do your homework, and watch out for little green men with maintenance kits.
And last, but not least, is taking “expert” information at face value. My first experience on the HP rack (running a Series 70) was with an HP SE who told me how to shortcut an OS update. Sounded good, I could use the extra time because I was updating on a Wednesday night (see No. 3). Before I knew it, I was staring at this message on the console: “Volume table destroyed, must reload.”
After that, I dropped SE support, figuring I was quite capable of destroying my system without high priced assistance. If you don’t feel confident about what you’ve been told, post to the 3000-L and see what your peers have to say.
6. The odd couple
For every system management Oscar Madison, leaving old files around to clog up and slow down his system or creating his own collection of foo, temp, K or Q files, there is a Felix Unger counterpart out there, obsessively tidying up. Both personality types have been known to shoot themselves in the foot.
The slobs make their lives miserable by never archiving files, which eventually bites them when they run out of space and the backup takes ever-longer. They also suffer from having multiple versions of all kinds of things on disk, running the risk of executing the wrong version or accessing the wrong file. And of course there are performance and security penalties for a messy system.
But the fastidious system manager also has issues. For one thing, being too diligent about cleaning up can result in missing files. Here is a case where automation can be a negative. Jobs that run every so often, archiving files that haven’t been accessed for a certain amount of time, can wind up archiving a file just before you need it.
Or, in my case, I once archived a file in the VESOFT account that hadn’t been accessed in years, only to discover it was some kind of special file that had to be there, even though it was never accessed (go figure).
Yes, it’s still good to be conscientious about keeping your system tidy. Just don’t overdo it.
You deserve a break today
If we can just step back and catch ourselves in dysfunctional behavior, we can start giving ourselves a break. We should not need to carry a pager, cell phone and laptop with us on vacation — for those brave enough to take a vacation, that is. We should not spend most of our time while out on the road on our phones, explaining how to recover our systems or where critical files are hidden. We should not expect to get raises when we spend so much of our professional time performing tasks that an entry-level employee can handle. By cleaning up our acts, we can stop reacting to self-inflicted busy work, which will free up time for more important tasks — like reading the NewsWire.