SLA and MRTG

Over the last few days here at the Circus we have been playing around with trying to test our service level agreements (SLA). It came about because one of our off-campus sites was having connectivity issues and were extremely vocal in their complaints, squeaky wheel and all that.

The problem is that a vendor was blaming their poor performance on the site connectivity. Of course we set up nagios to poll every minute but that wasn’t good enough. We needed to be able to graph response time. Eventually I wrote a perl script to feed data to MRTG but before we did that we played around with IOS rtr and ip sla commands.

I was working on something else when my counterpart began playing with ip sla and rtr, so I decided to lab it even though it is not on the ONT exam. Below is an image of the lab setup I am using. Once again it is the general cabling diagram from the ONT Lab book because I am not changing the wiring until I have to.

 

sla

Goal for the lab:

  • Have R1 download the index page from the web server 192.168.24.234 and report it’s statistics for a 24 hour period under the tag HTTP.234.
  • Have R1 report the SLA for tcpConnect for a one hour period to R2.

Below are my answers to the lab.

The server at 192.168.24.234 is running a web server. Let’s test downloading a file from the server:

R1#copy http://192.168.24.234/index.html null:
Loading http://192.168.24.234/index.html
55 bytes copied in 0.060 secs (917 bytes/sec)

A short note about IP SLA and responders. Depending upon version number and platform you are able to do different operations. It is interesting that Cisco SLA monitoring is very careful regarding time stamps. This is so that you can truly get line speed as opposed to application or processing delays. From the Cisco documentation, IP SLA test packets use time stamping to minimize the processing delays. When the IP SLA responder is enabled, it allows the target device to take time stamps when the packet arrives on the interface at interrupt level and again just as it is leaving, eliminating the processing time. This time stamping is made with a granularity of sub-milliseconds.

Now let’s see what operations sla supports on our router.

R1#sh ip sla monitor application
<omitted>
Supported Operation Types
Type of Operation to Perform: dhcp
Type of Operation to Perform: dns
Type of Operation to Perform: echo
Type of Operation to Perform: frameRelay
Type of Operation to Perform: ftp
Type of Operation to Perform: http
Type of Operation to Perform: jitter
Type of Operation to Perform: pathEcho
Type of Operation to Perform: pathJitter
Type of Operation to Perform: tcpConnect
Type of Operation to Perform: udpEcho
Type of Operation to Perform: voip

You might as well configure snmp on the router, I used the ubiquitous public community string, I would recommend changing that:

snmp-server community public RO

Now to configure a test of the sla in the lab:

ip sla monitor 1
 type http operation get url http://192.168.24.234/index.html
 tag HTTP.234
ip sla monitor schedule 1 life 86400 start-time now

Notice that when we scheduled it we are only going to run it for a day, 86,400 seconds with a start-time of now. If you wanted to run this test indefinitely you would configure life forever.
Now to show what is going on:

R1#sh ip sla monitor collection-statistics
Entry number: 1
Start Time Index: *15:43:14.400 UTC Sun Mar 31 2002
Number of successful operations: 5
Number of operations over threshold: 0
Number of failed operations due to a Disconnect: 0
Number of failed operations due to a Timeout: 0
Number of failed operations due to a Busy: 0
Number of failed operations due to a No Connection: 0
Number of failed operations due to an Internal Error: 0
Number of failed operations due to a Sequence Error: 0
Number of failed operations due to a Verify Error: 0
DNS RTT: 0
TCP Connection RTT: 57
HTTP Transaction RTT: 44
HTTP time to first byte: 86
DNS TimeOut: 0
TCP TimeOut: 0
Transaction TimeOut: 0
DNS Error: 0
TCP Error: 0
Transaction Error: 0

I also wanted to test the IP SLA tcpConnect SLA configuration. Here is the command to set up R2 as the responder:

ip sla monitor 2

And the commands to enable it on R1 as the source of the tcpConnect:

 type tcpConnect dest-ipaddr 192.168.12.2 dest-port 5000 source-ipaddr 192.168.12.1 source-port 5000
 timeout 1000
 frequency 10
ip sla monitor schedule 2 start-time now

And to confirm that is work on R1:

R1#sh ip sla monitor collection-statistics 2
Entry number: 2
Start Time Index: *10:14:13.723 UTC Mon Apr 1 2002
Number of successful operations: 6
Number of operations over threshold: 0
Number of failed operations due to a Disconnect: 0
Number of failed operations due to a Timeout: 4
Number of failed operations due to a Busy: 0
Number of failed operations due to a No Connection: 0
Number of failed operations due to an Internal Error: 1
Number of failed operations due to a Sequence Error: 0
Number of failed operations due to a Verify Error: 0

Now to confirm that is work in R2:

R2#sh ip sla monitor responder
IP SLA Monitor Responder is: Enabled
Number of control message received: 93 Number of errors: 0
Recent sources:
	192.168.12.1 [01:21:55.972 UTC Fri Mar 29 2002]
	192.168.12.1 [01:21:45.968 UTC Fri Mar 29 2002]
	192.168.12.1 [01:21:35.972 UTC Fri Mar 29 2002]
	192.168.12.1 [01:21:25.972 UTC Fri Mar 29 2002]
	192.168.12.1 [01:21:15.968 UTC Fri Mar 29 2002]
Recent error sources:

Back to the problem at hand. We were not getting good graphs from the data in our SLA configuration. The problem was the the MIB was not returning information that made graphable sense to MRTG. Which is when I got involved to write a script that would help us out.

This is how you would download snmp data from your router:

# snmpwalk -v 2c -c public 192.168.12.1 1.3.6.1.4.1.9.9.42.1.3.4.1.11.1
SNMPv2-SMI::enterprises.9.9.42.1.3.4.1.11.1.104057532 = Counter32: 329

And to make it more MRTG friendly:

# snmpwalk -v 2c -c public 192.168.12.1 1.3.6.1.4.1.9.9.42.1.3.4.1.11.1 | cut -d \: -f 4 | sed -e 's/ //g'
357

Regardless, I abandoned this when our graphs were not that helpful and moved on to another format. This script and resulting graph show the ping and http download speed to the web server in question. I realize there is a considerable amount of application latency built in, and the graphs also confirm this. Remember, Cisco sla takes great pains to eliminate the upper layer latency.

You can download this script in .tar or .pl. I have removed from perldoc formatting from the script below.

#!/usr/bin/perl
# 2009-10-13 Jud Bishop
# Please run perldoc on the script for more information.</code>

use strict;
use Time::HiRes qw(gettimeofday);
use LWP::Simple;

my $server = "192.168.24.234";
my $page = "/Prod/site/default.aspx";

# Should not have to change anything below this.
my $download = "http://" . $server . $page;

#Record time prior to request
my $start = gettimeofday();

# Test for successful download
if (head($download))
{
my $t = (gettimeofday() - $start) * 100;
printf ("%.4f \n", $t);
}
else {
print "0\n";
}

system "ping -c 1 $server | grep rtt | cut -d \= -f 2 | cut -d \/ -f 1 | sed -e 's/ //g'";

print "Web Response\n";
print "Ping Response\n";

=head1 NAME

web-ping.pl - A script to download a web page and ping a server to compare response times.

=head1 SYNOPSIS

A script that outputs the time in ms to download a webpage and ping a server.

=head1 DESCRIPTION

This is for graphing both page download and ping response time for MRTG.
The external command must return 4 lines of output:

Line 1 current state of the first variable, normally 'incoming bytes count' but it represents the web page load time.
Line 2 current state of the second variable, normally 'outgoing bytes count' but it represents the ping time.
Line 3 string (in any human readable format), telling the uptime of the target, not used.
Line 4 string, telling the name of the target, not used.

Put this in your 192.168.1.1.cfg file. You may need
to adjust the directories to match your configuration.

WorkDir: /usr/local/www/data-dist/stats/CircusStats2
Logformat: rrdtool
PathAdd: /usr/local/bin/
LibAdd: /usr/local/lib/perl5/site_perl/5.8.8/

Target[CircusStats-http]: `/usr/local/www/data/stats/configs/web-ping.pl`
Title[CircusStats-http]: Circus HTTP Response
PageTop[CircusStats-http]: Circus Response
LegendI[CircusStats-http]: HTTP Response
LegendO[CircusStats-http]: Ping Response
Ylegend[CircusStats-http]: Response in MS
Legend1[CircusStats-http]: HTTP Response
Legend2[CircusStats-http]: Ping Response
ShortLegend[CircusStats-http]: MS
routers.cgi*Options[CircusStats-http]: fixunit nototal nopercent nomax
routers.cgi*InCompact[CircusStats-http]: no
routers.cgi*Graph[CircusStats-http]: Circus-Combined noi

=head1 COPYRIGHT

Copyright 2009-10-13 Jud Bishop
Released under the GPLv2.
=cut

This is the resulting output from the script and MRTG configuration.

sla-mrtg

 

I used the IP SLA documentation to help me configure SLA, it is also the source of the quote above.

This entry was posted in Code, Linux, Routing. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s