In Part 1, I described a method of documentation where the introduction of the system is made using the documentation. This builds consensus, enculturates an operations group, and provides a platform unto which more automation can be built. In Part 2, I elaborated on the ideas of Bootstrapping & Rooting, Self Service Culture, and Selfish Documentation.
This is not an argument for documentation over automation, but more about fostering culture and the necessary human/system interactions. Even automation needs documented, especially from an architectural perspective. Drawings may replace individual commands. Background descriptions and use cases may replace configuration syntax, but documentation is still necessary. I used the Gentoo Handbook as an example of this kind of documentation.
The seeds of this came to me several years while ago working in a fairly large web operations with about 1500 servers and 6 systems administrators. Later, I brought the techniques I had developed to my current employer and implemented them with 2 systems administrators managing about 100 server, the caveat being, that we also manage the network and all of the physical infrastructure too. About a year ago, our web operations shrunk a bit (primarily because of automation) and I now manage these 100 servers, network devices, security, power, HVAC, and all associated infrastructure (DNS, mail, etc) all by my lonesome. We have a highly automated environment, but still require documentation. The meta data is the key with such a wide array of technology.
Our wiki includes documentation on about 130 different pieces of software, 70 customers, 100 server, 30 network devices, 5 realms of physical infrastructure, security, and weekly, monthly, quarterly, semi-anual, and annual checklists. The software ranges from Firefox extensions to caching strategies for Apache when serving PDF files to Internet Explorer 6. This should give an impression of how necessary it is to have documentation, when I must scale my time between so many different pieces of infrastructure.
Real World Example
To create this real world example, I searched our internal wiki for the perfect specimen. I looked for an item that had started as a small project and eventually grew into something that required extensive automation. There are tens of examples, but the best example I could find was our backup system. When I first came here three years ago the backup system was based on Legato and had documentation tucked away in a text file somewhere on the systems administrators desktop. This breaks all of tenets of Bootstrapping & Rooting, Self Service Culture, and Selfish Documentation. If the systems administrator would have moved on, it would have been nearly useless for anyone else.
The first thing I did was fire up a wiki. Since I had no experience with Legato, the second thing I did was capture all of the basic commands. This at least got me self sufficient. This is where I left it for a time and it served me well.
About a year later, a project came down the pike to upgrade the backup system. We worked from a ticket, evaluated several different pieces of software and captured best practices from the Internet. After our research phase, we determined that our backup system would consist of a disk backup which would get migrated to tape using Bacula & Rsync. This would require some custom software to rsync the individual servers to the backup server which would eventually push to tape. As many of you know, there are a million ways to do this. On top of that, this was only the tip of the iceberg, some clients even required off site DVD backups. We experimented quite a bit before nailing it all
Some time later, we had a working home grown prototype and then we went into production with all of the associated testing and checklists (read restore tests). All the while we were tracking our architectural changes in the wiki, capturing routine operations, special operations, architectural changes, etc, etc. Over the next two years, more and more of the system was automated as necessary. Now, literally the entire system is automated except for restores. Even with a fully automated system, the documentation bootstraps the manual processes (such as changing code).
Since most people are familiar with Media WIki, I will start with the index. Notice how lightweight the main headings are: background, architecture, routine operations, special operations, installation, and archives. Even though the system is fully automated, notice from the index how meta data & qualitative actions are captured. Even contingencies for a failed backup are noted. Automation cannot possibly capture these things, they must be decided by human beings and written down. It is important to reach a zen state of automation and documentation, where they are perfectly balanced. Qualitative openings are left in the automation so that human beings can make decisions where necessary.
Next, there is an architectural drawing. Notice the level at which it is drawn, it has useful information while purposefully leaving out too many details. When I expand upon the code to satisfy business rules for customers, I use this drawing to dig back into the architecture. This document will, most likely only need changed at the end of the production life cycle. I would suspect it will live accurately for 3 to 7 years.
To expand upon the drawing, there is a bit of text in the wiki which describes some of the more important subsystems. This text can be updated as business rules change, leaving the drawing in tact and unchanged. While none of these scripts are ran manually on a daily basis, the documentation of their reasons for existence is useful when going back and modifying code. This kind of documentation, can be much more important than API documentation, which would in this case be command switches.
Finally, notice how the qualitative hooks into our automated system our thoughtfully documented. Notice the following text: “Most EYEMG employees”, which promotes self service within our group. These qualitative hooks are left in place to allow human beings to interact with the system where it makes sense. Though quite obvious with a a backup system, this can be quite difficult to ascertain with other systems such as installation scripts (especially during development cycles).
Realistic Hypothetical Example
As a much shorter hypothetical example, let’s say that tomorrow, one of my developers wants to implement a Rabbit/MQ instance with which I have no experience. First, I will create a ticket, begin my research, and finally boil down the knowledge I have gained into the wiki. I will now start to move from a qualitative approximation to a quantitative process. I will begin automating the installation and integration into our environment. When, I am finished, I will be left with a documented & automated install/deploy which I can troubleshoot using all that haphazard knowledge I learned and captured in the wiki. Honestly, it’s good stuff
As an extension, let’s say tomorrow I start having to troubleshoot problems, I will have rarely used commands from the build out documented to start the troubleshooting process. There is a deeply satisfying balance between quantitative and qualitative workflow which can be achieved with automation and documentation
I hope I have been able to demonstrate a slight paradigm shift in documentation. The focus is, perhaps, on scaling ones time and energy and finding the perfect balance between automation, documentation and fun.