Scripting & Automation has been a goal since the beginning of Unix and, let me state, that I believe that it is possible to achieve a Fully Automated Provisioning system in our production environments. In fact, I think it is essential that we develop fully automated provisioning systems to keep up with the rate of change that business demands these days. I also believe this is not the entire scope of our work in operational IT. To address this, I would like to identify a common workflow which I have identified when developing new systems and putting them into production.
I define qualitative actions as those which are completed manually and require many decisions, while quantitative actions are completed with programming or scripting. Depending on the scope that we take when investigating an IT system, they can be composed of one or both. When thinking in these terms, it is critical to remember that all quantitative solutions, lead back to a qualitative decision to automate. In fact, all quantitative solutions have their implementation and results in qualitative decisions and analysis.
In my experience, there has been a natural progression from qualitative to quantitative in almost every project I have ever worked on. This has also been true in the maintenance phase. When first deploying a technology there is a learning phase. Then there is a development phase where the new technology is used to create something useful to the business. Finally, this new technology will live out it’s maintenance life cycle in a production environment. At each of these phases, there are critical qualitative and quantitative inputs made to keep the project or system on course.
Having these tools in place are critical when trying to automate an operations environment.
- Ticket System: This applies to a scope much larger than just automation. A ticket system helps everyone track the progress of automation projects. This is critical because operations teams experience contant interruptions.
- Wiki: These are also useful for tracking progress of complex projects, but wikis shine implementing human scripting. Wikis are powerful for leveraging instructions and commands which are easy to follow and easy to edit
- Programming Language: Automated scripting with Bash is essential for basic scripts and automation which requires running droves of commands. Python, Ruby or Perl is essential for writing scripts which need key/value pairs. It is also important for interacting with vendor code, such as Redhat’s installer scripts.
I define qualitative actions as those made by a programmer or systems administrator. This includes decisions and inputs made. Fixing a problem manually or having a meeting to determine the best architecture are good examples. If the scope is big enough, all actions are qualitative, a systems administrator running a script with chosen command line options, for example.
At the beginning of every new technology project there is a qualitative phase. There are also risks associated with automation at this phase. If automation is completed too early, you may waste time developing code for something that never comes to fruition. At this phase, there is also diminished return from implementing automation because this will only occur once.
In the maintenance phase it is not always self evident what should and should not be automated. The goal is to automate everything that is more efficient to automate, not more so. This can be tough for beginning programmers/systems administrators to identify.
Qualitative Use Cases
Examples of this would include evaluating a new monitoring system. If a team is evaluating a new monitoring system, it would be inefficient to implement automated deployment until the system has been selected. This is true in any technology selection process. This may be considered self evident, but I enumerate it for completeness.
Another commonly overlooked example is the code writing it’s self. It is a qualitative endeavor that contains a lot of engineering, choice, and opinion.
Another use is “one of a kind servers” that are extremely complex, such as a pair of KVM servers running DRBD/GFS/Redhat Cluster with bonded Ethernet interfaces on the back interfaces and bridged networking on the front interfaces for VLAN support The installation/configuration of a cluster like this is the perfect candidate for human scripting, which might take several hours. This is not a candidate for automated scripting because it might take 200 hours of work to do reliably. Nonetheless, once this cluster is built it would still be useful to control it’s ntp.conf file with some kind of deployment automation.
Development environments for programmers/designers/systems administrators. These by their very nature are qualitative. They are often one of a kind systems that do not have programmatic descriptions of their configuration or software installation. Indeed to do so would slow creativity. Exception, is not a problem in this case, it is the efficient economic choice because the amount of investment in automation will never pay returns. That’s not to say that automation shouldn’t be used where it is easily applied (OS installation for example), but that extra time should not be devoted to automate the installation of systems such as these.
Sometimes operations will determine that something is too hard to automate and spend exorbitant amounts time manually handling a problem. Log analysis and security of perfect examples of unlimited want, and limited resource. In these scenarios, automation, must be employed to achieve the goals.
Another pitfall is standardization. Sometimes standardization will hold back creative solutions to problems. Once, I developed an automated apache virtual build script for a service provider. At first the team was apprehensive about deployment because it required the use of home grown environment variables that were unique to each server. Some of the team didn’t like relying on these variables for the script. In the end, I built testing into the script and it exited if the variables weren’t found. The systems administrator would then manually/qualitatively add these unique environment variables, which was a finite cost per server. The script turned an hour job into 1 minute of answering questions in a script. The script took approximately 8 hours of coding and saved nearly a hundred hours per year for 5 years. That is a net gain of 402 hours or about 10 weeks of work.
Another risk, is entropy. When several different people do things different ways, it makes it very difficult to automate. This can become a self fulfilling prophesy.
Human scripting and Automated scripting are quantitative in nature. The initial construction can be daunting and it can be difficult to make a cost/benefit decision up front. Automated solutions are critical for repetitive tasks. When automated solutions are deployed, testing becomes absolutely necessary. Build time testing and run time testing (monitoring) are needed.
Automation is similar to the sales funnel. There are many leads, but some are converted to opportunities. The same is true in development of automation. Some are good customer, others are not.
The natural flow of automation is to go from a completely manual system where a systems administrator or programmer must remember how to make a change. Next, comes a well documented system with instructions. The next step, I call human scripted and I learned this technique from performing maintenance actions. Document every command that will be run in a wiki and act them out as an actor would, the nice part is you don’t have to remember all of your lines. There may be some customization while running the commands or code, but it is generally an excellent solution for producing consistent, time saving results. Finally, full automation is achieved in code. The well practiced human script is translated into code and each step is tested automatically, but that will be the subject matter of future post/software release called scriptlog.
Use cases for automation, in general, seem to require less justification.
Quantitative Use Cases
A common example is testing. Continuous integration ((http://en.wikipedia.org/wiki/Continuous_integration)) is an excellent use of automation in the project phase.
Installation and configuration to a baseline is also another hugely efficient use of automation. When there are a large number of homogeneous systems, this kind of automation will pay dividends
Deployment of application code, configuration, content. This can be taken as far as continuous deployment ((http://timothyfitz.wordpress.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/)), but if you are doing any kind of software development in house, this is important to automate.
Configuration build tools. When building configurations for DNS, Nagios, or firewalls, automation of configuration file builds is essential. A perfect example is my Apache Virtual build tool mentioned before.
Automated systems that are used once every 3 months will become difficult to trust/use, especially when underlying infrastructure has changed. They will be in a fault state more often than not. The cost of maintenance will outweigh the amount of time saved using the system. A qualitative solution becomes higher quality and more efficient in this scenario. Once, I wrote a Redhat Linux Cluster deployment script. At the end of two years I had nearly 100 hours in it’s coding/testing/deployment. It saved about an hour per installation and we built 5 clusters in two years. This is a net loss of 95 hours. I made this bad decision because I thought the installation process was fraught with perilous caveats which would be better expressed in code. In retrospect our semi-automated human scripting in the wiki was a much better solution. On top of that, the automated system was honestly scary to use after 3-6 months of disuse.
Deployment of data can be another caveat. Think of the problem wordpress.com has, where people are constantly deploying new content. If upgrades are made in the blog post class/data structure, there is no easy way to revert unless it is built into the architecture.
- The most important factor in automation is scope. Some systems will be completely automated in one scope, but rely on qualitative input at a higher level
- It is natural to flow from qualitative to quantitative solutions as demand grows in certain areas of operations
- Counter intuitively, it is also natural for automated systems to become disused. It is difficult to determine when removing automation is more efficient than upkeep.
- Analogous to Artificial Ignorance. Strive to automate all known/typical paths. Prioritize automation by frequency of occurrence.
- Stop automating when the law of diminishing returns applies. Do not plan too much for unknowns, they are by their nature inefficient to plan for. It will be more efficient to handle them as they happen
- Systems administrators, more so than programmers, run the risk of leaving to many things to manual resolution.
- Programmers are often better at producing quantitative solutions, but the danger is over use or play syndrome. Do not automate because it is fun (though it is).
Please leave me a comment and let me know your thoughts on when and how to automate?