<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Crunch Tools</title>
	<atom:link href="http://crunchtools.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://crunchtools.com</link>
	<description></description>
	<lastBuildDate>Wed, 11 Jan 2012 20:57:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
		<item>
		<title>Pittsburgh Perl Workshop</title>
		<link>http://crunchtools.com/pittsburgh-perl-workshop/</link>
		<comments>http://crunchtools.com/pittsburgh-perl-workshop/#comments</comments>
		<pubDate>Sun, 09 Oct 2011 17:22:26 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Event]]></category>

		<guid isPermaLink="false">http://crunchtools.com/pittsburgh-perl-workshop/</guid>
		<description><![CDATA[Originally inspired by similar workshops in Europe, The Pittsburgh Perl Workshop was established in 2006 as a low-cost technical conference for users of the Perl Programming Language. The conference emphasizes real code and immediate, pratical solutions to common issues.]]></description>
			<content:encoded><![CDATA[<span id="Background"><h2>Background</h2></span>
<p>From the organizers:</p>
<blockquote><p>Originally inspired by similar workshops in Europe, The Pittsburgh Perl Workshop was established in 2006 as a low-cost technical conference for users of the Perl Programming Language. The conference emphasizes real code and immediate, pratical solutions to common issues.</p></blockquote>
<p>This year <a href="http://www.allgoodbits.org/">Duncan Hutty</a> was kind enough to me to speak at the Pittsburgh Perl Workshop in the &#8220;Ops Track&#8221;. I gave a talk on how to <a href="http://crunchtools.com/rubust-monitoring/">Design a Robust Monitoring System</a>. Most of the material in the talk came from my experiences at <a href="http://www.eyemg.com/">EYEMG</a> and <a href="http://www.americangreetings.com/">American Greetings Interactive</a>, which when I worked there was a whole owned subsidiary of American Greetings which ran all the web properties.</p>
<p>The keynote was given by <a href="http://everythingsysadmin.com/">Thomas Limoncelli</a> who wrote <a href="http://www.amazon.com/o/ASIN/0201702711/tomontime-20">The Practice of System and Network Administration.</a> During his talk, I was pleased to pick up that Google had developed a similar scheme for breaking down events into three categories, Record, Action, and Critical Action. For more info, check out the <a href="http://crunchtools.com/rubust-monitoring/">talk</a>.</p>
<p>It was well worth the two beautiful hour drive and I hope to see more people at next years conference</p>
<p>&nbsp;</p>
<span id="Schedule"><h2>Schedule</h2></span>
<p>Room Opens 6:30pm, Dinner 7-8pm, presentation 8-9pm</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/pittsburgh-perl-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Designing a Robust Monitoring System</title>
		<link>http://crunchtools.com/rubust-monitoring/</link>
		<comments>http://crunchtools.com/rubust-monitoring/#comments</comments>
		<pubDate>Sat, 08 Oct 2011 23:25:30 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Presentation]]></category>
		<category><![CDATA[log analysis]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://crunchtools.com/log-analysis-with-python-2/</guid>
		<description><![CDATA[This presentation was created for the Pittsburgh Perl Workshop 2011. It gives an overview of a fairly sophisticated set of criteria to build a large Nagios installation.]]></description>
			<content:encoded><![CDATA[<span id="Abstract"><h2>Abstract</h2></span>
<p>This presentation was created for the Pittsburgh Perl Workshop 2011. It gives an overview of a fairly sophisticated set of criteria to build a large Nagios installation.</p>
<p>&nbsp;</p>
<span id="Presentation"><h2>Presentation</h2></span>
<p><a href='http://crunchtools.com/wp-content/blogs.dir/23/files/2011/10/Robust-Monitoring.pdf'><img src="http://crunchtools.com/wp-content/blogs.dir/23/files/2011/10/robust_monitoring-300x218.png" alt="" title="robust_monitoring" width="300" height="218" class="alignnone size-medium wp-image-2314" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/rubust-monitoring/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Akron Linux User Group (ALUG) July 7th</title>
		<link>http://crunchtools.com/akron-linux-user-group-alug-july-7th/</link>
		<comments>http://crunchtools.com/akron-linux-user-group-alug-july-7th/#comments</comments>
		<pubDate>Thu, 02 Jun 2011 02:10:16 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Event]]></category>

		<guid isPermaLink="false">http://crunchtools.com/akron-linux-user-group-alug-july-7th/</guid>
		<description><![CDATA[            ]]></description>
			<content:encoded><![CDATA[<span id="Background"><h2>Background</h2></span>
<p>Terry Morris will be presenting a talk on advanced grep usage.</p>
<p>&nbsp;</p>
<span id="Schedule"><h2>Schedule</h2></span>
<p>New Era Restaurant, 10 Massilon Rd, Akron<br />
Room Opens 6:30pm, Dinner 7-8pm, presentation 8-9pm</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/akron-linux-user-group-alug-july-7th/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Akron Linux User Group (ALUG) June 2nd</title>
		<link>http://crunchtools.com/akron-linux-user-group-alug-june-2nd/</link>
		<comments>http://crunchtools.com/akron-linux-user-group-alug-june-2nd/#comments</comments>
		<pubDate>Thu, 02 Jun 2011 02:05:35 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Event]]></category>

		<guid isPermaLink="false">http://crunchtools.com/akron-linux-user-group-alug-june-2nd/</guid>
		<description><![CDATA[            ]]></description>
			<content:encoded><![CDATA[<span id="Background"><h2>Background</h2></span>
<p>Dave Egts will be presenting &#8220;User confinement with SELinux in Red Hat Enterprise Linux 6:<br />
Easily letting users get their job done, and that&#8217;s it&#8221;</p>
<p>&nbsp;</p>
<span id="Schedule"><h2>Schedule</h2></span>
<p>Room Opens 6:30pm, Dinner 7-8pm, presentation 8-9pm</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/akron-linux-user-group-alug-june-2nd/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Monitoring Data Structure Metrics</title>
		<link>http://crunchtools.com/monitoring-data-structure-metrics/</link>
		<comments>http://crunchtools.com/monitoring-data-structure-metrics/#comments</comments>
		<pubDate>Wed, 25 May 2011 05:16:19 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Article]]></category>

		<guid isPermaLink="false">http://crunchtools.com/?p=2199</guid>
		<description><![CDATA[I finished reading this article on <a href="http://highscalability.com">High Scalability</a> entitled, <a href="http://highscalability.com/process/CreateJournalEntryComment?moduleId=4867632&#038;entryId=11418963">Troubleshooting Response Time Problems – Why You Cannot Trust Your System Metrics</a> and it reminded me of why I developed a Cacti graphing plugin for monitoring <a href="http://crunchtools.com/software/crunchtools/cacti/graph-sockets-pipes-files/">sockets, pipes and files.]]></description>
			<content:encoded><![CDATA[<p>I finished reading this article on <a href="http://highscalability.com">High Scalability</a> entitled, <a href="http://highscalability.com/process/CreateJournalEntryComment?moduleId=4867632&#038;entryId=11418963">Troubleshooting Response Time Problems – Why You Cannot Trust Your System Metrics</a> and it reminded me of why I developed a Cacti graphing plugin for monitoring <a href="http://crunchtools.com/software/crunchtools/cacti/graph-sockets-pipes-files/">sockets, pipes and files</a>.</p>
<p>This article is interesting because it basically demonstrates the same thing at the JVM level. Monitoring instantiated data structures and access to function/API calls is critical for guaranteeing performance of your application. I think of it as run time unit testing. It&#8217;s too bad we couldn&#8217;t climb up the stack a hair further and standardize on some essential data structures to monitor in all of the virtual machines: JVM, Ruby, Python, Perl, etc.</p>
<p>&nbsp;</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&nbsp;</div></div>
]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/monitoring-data-structure-metrics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cleveland Python Users Group (CLEPY) May 9th</title>
		<link>http://crunchtools.com/cleveland-python-users-group-clepy-may-9th/</link>
		<comments>http://crunchtools.com/cleveland-python-users-group-clepy-may-9th/#comments</comments>
		<pubDate>Sun, 08 May 2011 16:09:21 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Event]]></category>

		<guid isPermaLink="false">http://crunchtools.com/cleveland-python-users-group-clepy-may-9th/</guid>
		<description><![CDATA[            ]]></description>
			<content:encoded><![CDATA[<span id="Background"><h2>Background</h2></span>
<p>Jumpstart: how to do a start up / get involved in entrepreneurship.
<p><a href="http://clepy.org/events/17026236/?a=me1.2o_grp&#038;eventId=17026236&#038;action=detail&#038;rv=me1.2o&#038;rv=me1.2o">MeetUp Link</a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/cleveland-python-users-group-clepy-may-9th/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Going to Red Hat</title>
		<link>http://crunchtools.com/going-to-red-hat/</link>
		<comments>http://crunchtools.com/going-to-red-hat/#comments</comments>
		<pubDate>Sun, 24 Apr 2011 23:06:03 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Article]]></category>

		<guid isPermaLink="false">http://crunchtools.com/going-to-red-hat/</guid>
		<description><![CDATA[Well, it&#8217;s official, I have accepted a position at Red Hat. I am excited because Red Hat is a company that I have wanted to work with since I started using Linux 1998. For Red Hat, I will be a Solutions Architect for Enterprise Linux, also known as a technology evangelist. Now, it&#8217;s my job [...]]]></description>
			<content:encoded><![CDATA[<p>Well, it&#8217;s official, I have accepted a position at Red Hat. I am excited because Red Hat is a company that I have wanted to work with since I started using Linux 1998. For Red Hat, I will be a Solutions Architect for Enterprise Linux, also known as a technology evangelist. Now, it&#8217;s my job to spend time with customers and explain the technology which is something I already love doing.</p>
<p>My first experience was in 1998 on a Compaq 1135 laptop. I remember buying a Red Hat 5.2 box at Best Buy after a good friend of mine, Chad Remesch, gave me a load of grief because I didn&#8217;t know what Linux was. What a struggle it was to put Linux on that laptop. </p>
<p>I had to boot between Window and Linux to get on the Internet because I had a winmodem. Eventually, I saved money to buy a Zoom PCMCIA modem, which had Linux drivers on a floppy. Once I compiled and installed the drivers I was dangerous and on the Internet with Linux! I probably called Shad two times a week for six months, but I eventually got there.</p>
<p>Linux has come a long way since those days! Now I talk to people all the time who have used and installed Linux on laptops, desktops, and servers. Most of them did it by themselves with very little help.</p>
<p>I have also come a long way since then. In the last couple of years, I have had the opportunity to implement a datacenter in almost all open source software. From Red Hat Linux to Bacula, to LAMP.</p>
<p>Now, at Red Hat we will be making inroads in many other data centers, I am looking forward to this new challenge, spreading Open Source software, a dream I have had for quite some time.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fcrunchtools.com%2Fgoing-to-red-hat%2F&amp;title=Going%20to%20Red%20Hat" id="wpa2a_2"><img src="http://crunchtools.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/going-to-red-hat/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Do Rockstar Sysadmins Exist</title>
		<link>http://crunchtools.com/do-rockstar-sysadmins-exist/</link>
		<comments>http://crunchtools.com/do-rockstar-sysadmins-exist/#comments</comments>
		<pubDate>Sun, 24 Apr 2011 16:32:41 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Article]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Systems Administration]]></category>

		<guid isPermaLink="false">http://crunchtools.com/?p=2169</guid>
		<description><![CDATA[A couple of weeks ago, I heard the owner of our company talking on the phone to a client. In the conversation, he referred to me as a rockstar sysadmin. Thankfully, he wasn&#8217;t talking about my singing. I chuckled a bit, but didn&#8217;t think too much of it. I mean, it feels good to be [...]]]></description>
			<content:encoded><![CDATA[<p>A couple of weeks ago, I heard the owner of our company talking on the phone to a client. In the conversation, he referred to me as a <em><a href="http://www.pcworld.com/article/136688/calling_rock_star_sysadmins.html">rockstar sysadmin</a></em>. Thankfully, he wasn&#8217;t talking about my singing. I chuckled a bit, but didn&#8217;t think too much of it. I mean, it feels good to be called a rock star, but really, it&#8217;s not like sys admins have groupies throwing their underwear on stage (long story).</p>
<p>Then, a couple of days ago we talked about the concept of a rockstar sysadmin. Is it possible for a sysadmin to stay a rockstar long term or does this person usually go on to be a sales engineer, programmer, etc? If it is possible, is it possible to be a rockstar windows sysadmin or do you have to be a jack of all trades?</p>
<p>What do you think? Leave comments?</p>
<p>&nbsp;</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fcrunchtools.com%2Fdo-rockstar-sysadmins-exist%2F&amp;title=Do%20Rockstar%20Sysadmins%20Exist" id="wpa2a_4"><img src="http://crunchtools.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/do-rockstar-sysadmins-exist/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Designing a Robust Monitoring System</title>
		<link>http://crunchtools.com/designing-a-robust-monitoring-system/</link>
		<comments>http://crunchtools.com/designing-a-robust-monitoring-system/#comments</comments>
		<pubDate>Tue, 19 Apr 2011 15:35:03 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Article]]></category>
		<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Featured]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Monitoring]]></category>
		<category><![CDATA[Systems Administration]]></category>

		<guid isPermaLink="false">http://crunchtools.com/?p=2140</guid>
		<description><![CDATA[Reading Ted Dziuba's article Monitoring Theory article, I was reminded of several conventions that I have developed over the years to help with monitoring servers, network devices, software services, batch processes, etc. First, break down your data points into levels so that you can decide how to route them. Second avoid interrupt driven technology like email, it lowers your productivity and prohibits good analysis techniques.]]></description>
			<content:encoded><![CDATA[<span id="Background"><h2>Background</h2></span>
<p>Reading Ted Dziuba&#8217;s article <a href="http://teddziuba.com/2011/03/monitoring-theory.html">Monitoring Theory</a> article, I was reminded of several conventions that I have developed over the years to help with monitoring servers, network devices, software services, batch processes, etc. First, break down your data points into levels so that you can decide how to route them. Second avoid interrupt driven technology like email, it lowers your productivity and prohibits good analysis techniques.</p>
<p>I like Ted&#8217;s break down into three main categories: &#8220;Neither Informative nor Actionable&#8221;, &#8220;Informative, but not Actionable&#8221;, and Informative and Actionable. I have made a few modifications to Ted&#8217;s breakdown, but his gist is spot on. </p>
<p>I also break things down into three categories: Logged, Non-Critical Action, Critical Action. This allows me to quickly operationalize any data point that I can conceive. First, the data point should either go to a log or it should alert. Second, if it is indeed an alert, then it should be prioritized, non-critical (8-5) or critical (24&#215;7).
<p>Personally, I also like to have the non-critical alerts go to a dashboard instead of paging, you really have no choice but to page for critical alerts.  There are things that need taken care of and there are things that can and should wait. If you make everything critical, you will burn your operations people out. Surprisingly, a lot of smart people get this wrong.</p>
<p>&nbsp;</p>
<span id="Logged"><h2>Logged</h2></span>
<p>This is anything that I really don&#8217;t need to know about unless there is some other problem. This is how a web server transaction works, if there is a problem you receive an error, otherwise, just log it or graph it. Also, I suggest staying away from email for this role. You usually end up pestering the operations people to the point that they filter the email to a folder and never look at it. It is better to use a centralized logging system of some kind and then do proper log analysis on it. I use <a href="http://crunchtools.com/petit">Petit</a> to write reports for <a href="http://crunchtools.com/centralizing-log-files/#Analysis">daily, weekly, and monthly analysis</a>. The reports give an approximation of reality, a folder in your email gives you nothing.</p>
<p>There are two main types of data here, numbers and letters. If it is some kind of numeric data that you are capturing, use a graph. If the data you are capturing is made up of discreet entries, use a log, then graph the log if it makes sense. Again, don&#8217;t email yourself the graphs, have it in a central system, something like Cacti is good.</p>
<p>Examples of logged data points include successful processes of all kinds:</p>
<ul>
<li><strong>Data Import/Exports</strong>: In many environments there are data import/export jobs which are critical to applications. That is fine, but you don&#8217;t want to pollute your email with successful job output, instead put it into a log where it can be properly analyzed. </li>
<li><strong>Granular Job Tracking</strong>: When you put job tracking data into a log instead of email, you can start to take advantage or very granular job tracking. Instead of logging success/failure for the whole job, you can start to track individual parts of a job and really get some granularity to your system. This helps track down partial failures like slowdowns, etc.</li>
<li><strong>Load Average, CPU, Memory</strong>: Many people make the mistake of thinking these are good fault monitors. They are not, graph them instead. They will help tell you when a problem started after you receive a fault from some other check.</li>
<li><strong>BGP Route View Checks</strong>: In our environment, we check the ATT IX BGP looking glass every four hours to get a list of routes back to us through our two upstream providers. The Internet is dynamic, so sometime the number of routes through each provider changes. That is is fine, but I might want to have it logged if I have a problem later.</li>
<li><strong>Trace Routes</strong>: We log trace route between our data centers to have a historic view of what path things are taking. This can help when troubleshooting flaky VPN connections, but I don&#8217;t want to know about it unless the VPN get&#8217;s flaky.</li>
<li><strong>Conifig File Generation</strong>: In many environments, configuration files are built and deployed. Many pieces of this process should be logged. It provides a paper trail for what has been deployed and may be used later if a bug is found in your build process. This will give you a starting point for repairing/rebuilding systems that were deployed during the time the <em>bug</em> was in the wild. It may also help track a start/stop time for when things occurred.</li>
<li><strong>Backup Processes</strong>: Parts of the back up process, for example, MySQL dumps, SVN Dumps, special DVD copies can be orchestrated and logged from a central script. Your operations team won&#8217;t need to know this in an email or on a daily basis, but it is a blessing when trying to reverse engineer the system when there is a problem.</li>
</ul>
<p>Now, you shouldn&#8217;t forget that there is granularity here. If a particular piece of a process, check, import, etc fails, you can escalate it to a non-critical or critical action. Sometimes it is even necessary to cause cascading failure. For example, when collecting data points for a geographically redundant web application, if MySQL replication fails, you may also want to stop synchronization of a docroot until an operations person can investigate what went wrong.</p>
<p>&nbsp;</p>
<span id="Non-Critical_Action"><h2>Non-Critical Action</h2></span>
<p>I prefer piping this kind of data point to a dashboard instead of paging. I expect our operations people to look at the dash board first thing in the morning and periodically through out the day. These kinds of data points do not need tending in the middle of the night.</p>
<p>You can be a bit more liberal with non-critical alerts, especially if they are fed to a dash board, but don&#8217;t get carried away. You don&#8217;t want to pester your operations people so bad during the day that they can&#8217;t get project work done</p>
<p>Examples of non-critical actions include failed processes of all kinds</p>
<ul>
<li><strong>Software Vulnerabilities</strong>: These are published on public RSS feeds, but your operations team isn&#8217;t going to care about this at 2AM when they are sleeping. They can look at it during business hours. Even if it is a critical vulnerability, no one is going to look at this at 2AM. Many people get this wrong.</li>
<li><strong>SLA in Danger</strong>: These kinds of checks can be very difficult to tune. Four second page load times may be OK at 4AM, but not at 11AM. Sometimes time based metrics are necessary, but this can require significant scripting or configuration in your monitoring system. I prefer embedding this in some kind of scripting framework because it gives you granular access by check. Often this is difficult to do in something like Nagios.</li>
<li><strong>Tape Cleaning</strong>: Good to know, but don&#8217;t wake me up</li>
<li><strong>Captured Command Output</strong>: When scripting use a utility to collect the non-zero output of important commands. If something goes wrong alert. This provides very granular reporting of failed portions of a script. I use something called scriptlog to pipe command return values and output to syslog. Then I have a Nagios check which captures failed commands and displays a warning in the dashboard. This allows the team to fix problems in the morning and also gives good place to start when fixing scripting problems.</li>
</ul>
<p>&nbsp;</p>
<span id="Critical_Action"><h2>Critical Action</h2></span>
<p>Do not underestimate how difficult it is to determine these kinds of checks. It is really easy to put too many things in this category. If a switch, router, firewall, server, apache, or the application is in a fault state, someone must be alerted and the service must be restored but be careful when specifying an SLA for slow service. Make sure they are well defined and manageable. For example, many people will ping network or server gear once a minute and page when it is slow for 2 minutes. This is just impossible to manage, it will page you at 2AM every day. This is completely useless and when you have a real problem people will end up waiting for a recovery notice and won&#8217;t even start investigating until the problem <em>seems</em> real</p>
<ul>
<li><strong>Network Down</strong>: It is preferable to monitor fault state with a true/false check of some kind such as ping. Routers can slow down when updating BGP and does not necessarily indicate an actionable alert unless the slow state persists for long enough.</li>
<li><strong>Server Down</strong>: Server down state can be even trickier to determine. Often a server will stop reporting for a while because of heaving processing. If there is enough deviation in your systems, you may even need different checks for different groups of servers.</li>
<li><strong>Service Down</strong>: Services can be difficult to check. These checks can become very sophisticated, which makes them impractical for one minute intervals. In a blog add/delete post can be tested; in a shopping cart an entire transaction can be tested. These checks can really stretch the boundary between testing and fault monitoring</li>
<li><strong>SLA Not Met</strong>: As mentioned above, be very careful when setting these, they can and will drive your operations team insane. Use them sparingly and look for more deterministic ways of finding faults. Also make sure it the SLA is not met over some number of checks over some number of minutes ore you WILL have false positives</li>
<li><strong>Open Sockets, Pipes, Files</strong>: Monitoring <a href="http://crunchtools.com/software/crunchtools/cacti/graph-sockets-pipes-files/">open sockets pipes and files</a> can be a very good indication that something is going to go wrong if you don&#8217;t look into it. If the threshold is high enough it can even indicate impending doom. In production, alerting on these numbers can give you five minutes to respond and fix before a melt down.</li>
</ul>
<p>Do not limit your imagination when determining what can be checked, but be realistic when determining what it means if a check fails for one, five or ten minutes. When determining an SLA for sophisticated check, I suggest setting your targets long especially during off hours. These numbers may sound wild, but I suggest you set the bar low and tune up. If you desensitize your operations people, your return to service times will be worse than these numbers anyway.</p>
<blockquote><p>Network Gear: 3-4 failed checks at one minute intervals<br />
Servers: 7-8 failed checks at one minute intervals<br />
Services: 7-8 failed checks at one minute intervals</p></blockquote>
<p>&nbsp;</p>
<span id="Conclusion"><h2>Conclusion</h2></span>
<p>Obviously, this system could be expanded upon to include more granular time windows or 24&#215;7 network operation centers, but hopefully this gives you the gist of how to break things down. The goal is to provide your operations team with real actionable items when their time is the most expensive (off hours) while at the same time providing reasonable coverage of problems.</p>
</p>
<p>&nbsp;</p>
<div class="codecolorer-container text default" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">&nbsp;</div></div>
]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/designing-a-robust-monitoring-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Akron Linux User Group (ALUG) April 7th</title>
		<link>http://crunchtools.com/akron-linux-user-group-alug-april-7th/</link>
		<comments>http://crunchtools.com/akron-linux-user-group-alug-april-7th/#comments</comments>
		<pubDate>Sat, 09 Apr 2011 16:53:30 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Event]]></category>

		<guid isPermaLink="false">http://crunchtools.com/?p=2083</guid>
		<description><![CDATA[            ]]></description>
			<content:encoded><![CDATA[<span id="Background"><h2>Background</h2></span>
<p>Gaurav Saxena will be giving a talk on Perl</p>
<p>&nbsp;</p>
<span id="Slides_038_Code"><h2>Slides &#038; Code</h2></span>
<p>Slides are <a href="https://docs.google.com/present/view?id=0AUtKibgU9JumZGMza3F2dzdfMTU1dHZ6M2QzemQ&#038;hl=en">here</a> (hosted on <a href="http://docs.google.com/">google documents</a>)</p>
<p>Code files are <a href="https://github.com/gsvolt/perl">here</a> (hosted on <a href="http://www.github.com/">github</a>)</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://crunchtools.com/akron-linux-user-group-alug-april-7th/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

