There are a lot of technology options out there. There are even a lot of free/open source technologies out there. So much so, that it’s tempting to install too much of it. Having too much technology can be just as bad as having too little, and “free” can become pretty costly. Obviously I’m not knocking free/open source, but the misapplication of it.
First and foremost, the more software/hardware you have, the more likely it is that some of it will have a bug. That’s just law of averages coupled with the fact that no significant software project is really bug-free.
Then there’s the maintenance effort. The more technology you have, the more effort needs to go into care and feeding of it. Also the more you have to learn about.
Lastly, just how Agile teaches us to delay decision-making and development to as late as possible, because that’s the point where we know the most about what’s needed, the more technology you put in place before you need it, the harder you make it to implement what you really need when you do.
Where am I going with this? About two weeks ago, I had a bashful server that decided not to start up again after I shut it down for maintenance. This server handled my email, many of my websites, my MythTV recordings, and some other functions. I had other issues with it that were equally mysterious. Having just made an unrelated very expensive purchase, I couldn’t just run out and drop five bills or more on new boxes. Even if I did, that would save time, but I would still have largely the same reconstruction issue (albeit on more powerful, known-working hardware). While it took much longer than I would have liked to get up and running again because I was doing it late at night, it could have been much worse. Keeping it simple is what helped me.
- As a Software Engineer, not a Systems Administrator, I’ve never fully understood all the options available for RAID and LVM. I knew if I had a crash that I would really be sunk not knowing a good deal about them before-hand, so all five hard drives in the system were JBOD. That means I can simply shove them in another box to look at them. For a home server that’s heavily used but doesn’t contain business data, I felt comfortable not having the redundancy that some levels of RAID offer.
- My backups are gzipped tar files copied to external USB hard drives (actually two- one at home and one somewhere else for an encrypted offsite backup). Those backup drives are only live when I back up, so they should last a good long time. Again, I can walk up to any other computer and extract or inspect my backups. No need to get a baseline system up with just enough to run a restore program.
- I don’t have services running in VMs for isolation (though some are in chroot jails).
Obviously I don’t recommend these specific practices for anyone who can either take the time to use them right, or needs guaranteed uptime, but that’s kinda my point. I’ve implemented what I need in the simplest way so that it’s easy to understand, maintain, and fix.
The same principle goes for companies, too. I have a friend who works as a Release Engineer in a company that has such a complex tool chain that cutting a release and preparing for developers to work on the next release can sometimes take two weeks! Most of that delay is because they’ve made things too flexible. In order to support mixing and matching of components in ways which have never been needed before or are predicted to be needed, there are too many levels of indirection and too much separation, so they can’t just say “OK Revision 32274 in the source control system is release 4.0. Let’s branch it….. Done!”.
A company I worked for a long time ago was effectively held hostage by the one person who knew and controlled the release process, until someone cracked the code and ended their Job Security Through Obscurity. It was largely a manual process fraught with danger zones. The system put in place afterwards took much less time, and was much more repeatable, because it was done using straightforward scripts instead of magic incantations.
Another group I know of has had two large server crashes, and both times, the last good backup was ancient, because the backup system was a complex mix of tarring files and transferring them from one machine to another, and no verification processes. You might argue that the system was not complex enough because it didn’t verify the backups, but I maintain that it was too complex because that made verification hard.
The take-away from all of this is to Keep IT Systems Simple. Learn from Agile and don’t put things in place before you need them. Learn from POSIX and build up functionality from independently verifiable components that can be replaced as needed. Learn from TDD and don’t put a system in place until you have a way of telling whether it’s working or not.