By Scott Hirsh
It’s sad but true: we system managers won’t cut ourselves any slack. We repeatedly put ourselves in jeopardy, often making the same mistakes time after time. We even break all the rules we impose on others. Don’t believe me? See if you recognize any of these examples.
1. Hand crafted system management
Ah yes, the good old days. Peace, love and tear gas (I never inhaled). But here’s a news flash, sunshine: for system managers, the ’60s are dead. Predictable, repeatable tasks can and should be automated. If you can script it, you can schedule it. And if you can schedule it, you can automate it. So what are you waiting for? Do you like (take your pick): streaming jobs by hand; adjusting fences and priorities by hand; reading $STDLISTs; staring at the console waiting for that one important message? For this you went to college?
And yet, we (or our management) come up with lots of lame excuses for running a stone-age operation. Can’t afford the automation products, don’t trust automation, can’t trap every error, blah blah blah. Those excuses may fly when you’re small, but suddenly you have more systems, bigger systems and manual management turns your shop into burn-out central. Now there’s turnover costs, downtime costs, opportunity costs.
Oh, and by the way, it’s much more expensive to implement automated management in a large, busy environment than it is to grow automated management from a smaller environment. Perhaps some of us are just adrenaline junkies, or we fear not being needed. Get over it and automate already.
2. The disappearing act
A close personal friend of mine — okay, it was me — once made a change to Security/3000’s SECURCON file, then left for an all-day meeting about 40 miles away. Guess what? None of the application users could log on after my change. Way back then, my pager almost vibrated off my belt from that one. And it made for some interesting meetings when I got back.
I have seen lots of cases where a system manager made a configuration change, installed a patch, or fussed with SYSSTART or UDCs, then immediately went home. Big mistake. If you’re lucky, you live near your data center and can zip right back to repair the carnage that was discovered right away. If you’re not lucky, first you don’t discover your mistake until the worst possible moment — say, around the heaviest usage period the next day — and then you’re forced to take the system down to fix the problem. Ouch.
3. A lack of planning on my part does constitute an emergency on your part
A variation on No. 1. We are the eternal optimists. No matter how invasive the procedure, everything will work out perfectly, right? How many PowerPatches must we install before we realize we must leave adequate time for testing the patched system and perhaps back that sucka out? No really, this time HP (or your favorite vendor) has learned from past mistakes and has a bullet-proof update. No need to leave a cushion for collateral damage. Right.
Every decent system administration book offers the same advice: Don’t do anything you can’t undo. Make a backup copy of whatever you’re changing. Keep track of the steps you followed. Be prepared to back out whatever you’re doing. Because that contingency time can inflate your update schedule by hours, it’s unlikely you can safely make a system change at any time other than weekends or holidays.
4. I’ve got a secret
You make changes but don’t tell anyone about them. Let’s be charitable and say your changes worked as planned. Unfortunately, nobody knew you were going to make the change. I have seen a change as innocuous as modifying the system prompt have unintended consequences (Reflection scripts looked for the old prompt and now wouldn’t work). The term “system” implies interrelationships. Anything we do has a ripple effect. When we don’t tell others that we’re about to make a change — “they wouldn’t let me do it if I told them!” — we don’t do ourselves any favors. I would love to hear other war stories under this category (hint, hint).