Over the last few days here at the Circus we have been playing around with trying to test our service level agreements (SLA). It came about because one of our off-campus sites was having connectivity issues and were extremely vocal in their complaints, squeaky wheel and all that.
The problem is that a vendor was blaming their poor performance on the site connectivity. Of course we set up nagios to poll every minute but that wasn’t good enough. We needed to be able to graph response time. Eventually I wrote a perl script to feed data to MRTG but before we did that we played around with IOS rtr and ip sla commands.
I was working on something else when my counterpart began playing with ip sla and rtr, so I decided to lab it even though it is not on the ONT exam. Below is an image of the lab setup I am using. Once again it is the general cabling diagram from the ONT Lab book because I am not changing the wiring until I have to.
Goal for the lab:
- Have R1 download the index page from the web server 192.168.24.234 and report it’s statistics for a 24 hour period under the tag HTTP.234.
- Have R1 report the SLA for tcpConnect for a one hour period to R2.
Below are my answers to the lab.
The server at 192.168.24.234 is running a web server. Let’s test downloading a file from the server:
R1#copy http://192.168.24.234/index.html null: Loading http://192.168.24.234/index.html 55 bytes copied in 0.060 secs (917 bytes/sec)
A short note about IP SLA and responders. Depending upon version number and platform you are able to do different operations. It is interesting that Cisco SLA monitoring is very careful regarding time stamps. This is so that you can truly get line speed as opposed to application or processing delays. From the Cisco documentation, IP SLA test packets use time stamping to minimize the processing delays. When the IP SLA responder is enabled, it allows the target device to take time stamps when the packet arrives on the interface at interrupt level and again just as it is leaving, eliminating the processing time. This time stamping is made with a granularity of sub-milliseconds.
Now let’s see what operations sla supports on our router.
R1#sh ip sla monitor application <omitted> Supported Operation Types Type of Operation to Perform: dhcp Type of Operation to Perform: dns Type of Operation to Perform: echo Type of Operation to Perform: frameRelay Type of Operation to Perform: ftp Type of Operation to Perform: http Type of Operation to Perform: jitter Type of Operation to Perform: pathEcho Type of Operation to Perform: pathJitter Type of Operation to Perform: tcpConnect Type of Operation to Perform: udpEcho Type of Operation to Perform: voip
You might as well configure snmp on the router, I used the ubiquitous public community string, I would recommend changing that:
snmp-server community public RO
Now to configure a test of the sla in the lab:
ip sla monitor 1 type http operation get url http://192.168.24.234/index.html tag HTTP.234 ip sla monitor schedule 1 life 86400 start-time now
Notice that when we scheduled it we are only going to run it for a day, 86,400 seconds with a start-time of now. If you wanted to run this test indefinitely you would configure life forever.
Now to show what is going on:
R1#sh ip sla monitor collection-statistics Entry number: 1 Start Time Index: *15:43:14.400 UTC Sun Mar 31 2002 Number of successful operations: 5 Number of operations over threshold: 0 Number of failed operations due to a Disconnect: 0 Number of failed operations due to a Timeout: 0 Number of failed operations due to a Busy: 0 Number of failed operations due to a No Connection: 0 Number of failed operations due to an Internal Error: 0 Number of failed operations due to a Sequence Error: 0 Number of failed operations due to a Verify Error: 0 DNS RTT: 0 TCP Connection RTT: 57 HTTP Transaction RTT: 44 HTTP time to first byte: 86 DNS TimeOut: 0 TCP TimeOut: 0 Transaction TimeOut: 0 DNS Error: 0 TCP Error: 0 Transaction Error: 0
I also wanted to test the IP SLA tcpConnect SLA configuration. Here is the command to set up R2 as the responder:
ip sla monitor 2
And the commands to enable it on R1 as the source of the tcpConnect:
type tcpConnect dest-ipaddr 192.168.12.2 dest-port 5000 source-ipaddr 192.168.12.1 source-port 5000 timeout 1000 frequency 10 ip sla monitor schedule 2 start-time now
And to confirm that is work on R1:
R1#sh ip sla monitor collection-statistics 2 Entry number: 2 Start Time Index: *10:14:13.723 UTC Mon Apr 1 2002 Number of successful operations: 6 Number of operations over threshold: 0 Number of failed operations due to a Disconnect: 0 Number of failed operations due to a Timeout: 4 Number of failed operations due to a Busy: 0 Number of failed operations due to a No Connection: 0 Number of failed operations due to an Internal Error: 1 Number of failed operations due to a Sequence Error: 0 Number of failed operations due to a Verify Error: 0
Now to confirm that is work in R2:
R2#sh ip sla monitor responder IP SLA Monitor Responder is: Enabled Number of control message received: 93 Number of errors: 0 Recent sources: 192.168.12.1 [01:21:55.972 UTC Fri Mar 29 2002] 192.168.12.1 [01:21:45.968 UTC Fri Mar 29 2002] 192.168.12.1 [01:21:35.972 UTC Fri Mar 29 2002] 192.168.12.1 [01:21:25.972 UTC Fri Mar 29 2002] 192.168.12.1 [01:21:15.968 UTC Fri Mar 29 2002] Recent error sources:
Back to the problem at hand. We were not getting good graphs from the data in our SLA configuration. The problem was the the MIB was not returning information that made graphable sense to MRTG. Which is when I got involved to write a script that would help us out.
This is how you would download snmp data from your router:
# snmpwalk -v 2c -c public 192.168.12.1 1.3.6.1.4.1.9.9.42.1.3.4.1.11.1 SNMPv2-SMI::enterprises.9.9.42.1.3.4.1.11.1.104057532 = Counter32: 329
And to make it more MRTG friendly:
# snmpwalk -v 2c -c public 192.168.12.1 1.3.6.1.4.1.9.9.42.1.3.4.1.11.1 | cut -d \: -f 4 | sed -e 's/ //g' 357
Regardless, I abandoned this when our graphs were not that helpful and moved on to another format. This script and resulting graph show the ping and http download speed to the web server in question. I realize there is a considerable amount of application latency built in, and the graphs also confirm this. Remember, Cisco sla takes great pains to eliminate the upper layer latency.
You can download this script in .tar or .pl. I have removed from perldoc formatting from the script below.
#!/usr/bin/perl # 2009-10-13 Jud Bishop # Please run perldoc on the script for more information.</code> use strict; use Time::HiRes qw(gettimeofday); use LWP::Simple; my $server = "192.168.24.234"; my $page = "/Prod/site/default.aspx"; # Should not have to change anything below this. my $download = "http://" . $server . $page; #Record time prior to request my $start = gettimeofday(); # Test for successful download if (head($download)) { my $t = (gettimeofday() - $start) * 100; printf ("%.4f \n", $t); } else { print "0\n"; } system "ping -c 1 $server | grep rtt | cut -d \= -f 2 | cut -d \/ -f 1 | sed -e 's/ //g'"; print "Web Response\n"; print "Ping Response\n"; =head1 NAME web-ping.pl - A script to download a web page and ping a server to compare response times. =head1 SYNOPSIS A script that outputs the time in ms to download a webpage and ping a server. =head1 DESCRIPTION This is for graphing both page download and ping response time for MRTG. The external command must return 4 lines of output: Line 1 current state of the first variable, normally 'incoming bytes count' but it represents the web page load time. Line 2 current state of the second variable, normally 'outgoing bytes count' but it represents the ping time. Line 3 string (in any human readable format), telling the uptime of the target, not used. Line 4 string, telling the name of the target, not used. Put this in your 192.168.1.1.cfg file. You may need to adjust the directories to match your configuration. WorkDir: /usr/local/www/data-dist/stats/CircusStats2 Logformat: rrdtool PathAdd: /usr/local/bin/ LibAdd: /usr/local/lib/perl5/site_perl/5.8.8/ Target[CircusStats-http]: `/usr/local/www/data/stats/configs/web-ping.pl` Title[CircusStats-http]: Circus HTTP Response PageTop[CircusStats-http]: Circus Response LegendI[CircusStats-http]: HTTP Response LegendO[CircusStats-http]: Ping Response Ylegend[CircusStats-http]: Response in MS Legend1[CircusStats-http]: HTTP Response Legend2[CircusStats-http]: Ping Response ShortLegend[CircusStats-http]: MS routers.cgi*Options[CircusStats-http]: fixunit nototal nopercent nomax routers.cgi*InCompact[CircusStats-http]: no routers.cgi*Graph[CircusStats-http]: Circus-Combined noi =head1 COPYRIGHT Copyright 2009-10-13 Jud Bishop Released under the GPLv2. =cut
This is the resulting output from the script and MRTG configuration.
I used the IP SLA documentation to help me configure SLA, it is also the source of the quote above.