CraigsList Crawler 3000

Update 2010-08-26
I have made changes to the script below as a result of some requests. The output should be easier to read.

I should also point out how to find the categories. The usage example that is output when you run the script with no command line switches is only an example, the script will search any category under the “for sale” heading of CraigsList. For instance, under “for sale” is the category “antiques” and when I click on it the link is below.

http://atlanta.craigslist.org/atq/

The category is “atq” in the URL and that is what you would put to search the “antiques” category with this script. The same construct applies if you would like to search “appliances” or any other category.

 Usage: ./CLCrawler3000.pl category keyword
 Exmaple: ./CLCrawler3000.pl sys "mac+mini" 
 Categories: 
 sys == computers
 tls == tools
 bik == bike
 sad == system admin jobs

So if I wanted look for a Linux system administration job I would type in:

 ./CLCrawler3000.pl sad linux

And if I wanted a an armoire in the antique category I would run the script with:

 ./CLCrawler3000.pl atq armoire

Original Post
The name of this script was given by one of my work mates, Scott, when he started using it to search CraigsList. I wrote this script when I became frustrated with the functionality of CraigsList. I live in a small town and I wanted to search for items on CraigsList, however, I would have to search the larger cities around me in order to find items I needed. It didn’t matter to me whether I went to Atlanta, Birmingham or Huntsville, I was still going to have to drive, and when you are looking for bikes on CraigsList you might as well search all of Colorado, California and Texas. The script just grew from there.

I will say that CraigsList has changed its’ output format a couple of times since I wrote this script. I also have had to make changes depending upon the category I was searching. Like all scripts on the internet, your mileage may vary but I hope you find this script as useful as I have.

I would also like to apologize for the code listing. I just used the simple code tag because more fancy highlighting did not look very good.

If you download the script and just run it from the command line, it will give you sample usage. It also outputs a file, clcrawler.html, which you can open in your web browser to view the results.

 Usage: ./CLCrawler3000.pl category keyword
 Exmaple: ./CLCrawler3000.pl sys "mac+mini"
 Categories:
 sys == computers
 tls == tools
 bik == bike
 sad == system admin jobs
#!/usr/bin/perl

use strict;
use LWP::Simple;
use HTML::TokeParser;

die " Usage: $0 category keyword\n Exmaple: $0 sys \"mac+mini\" \n Categories: \n sys == computers\n tls == tools\n bik == bike\n sad == system admin jobs\n " unless @ARGV;

# This is the category
my $cat = $ARGV[0] || "tls";

# This is the keyword you are looking for...
my $keyword =  $ARGV[1] || "surface+plate";

# This is the output file.
my $html = "clcrawler3000.html";

# Define the arrays for each state to be passed into craigslist search,
# by defining each state individually I can tailor my searches quicker.
my %states = (
	Alabama => [ qw(auburn bham columbusga huntsville mobile montgomery tuscaloosa) ],
	Florida => [ qw(daytona keys fortlauderdale fortmyers gainesville jacksonville lakeland miami ocala orlando pensacola sarasota spacecoast tallahassee tampa treasure westpalmbeach) ],
	Georgia => [ qw(atlanta columbusga athensga augusta macon savannah valdosta) ],
	Mississippi => [ qw(gulfport hattiesburg jackson northmiss) ],
	Kentucky => [ qw(bgky cincinnati huntington lexington louisville westky) ],
	SouthCarolina => [ qw(charleston columbia greenville hiltonhead myrtlebeach) ],
	Tennessee => [ qw(memphis chattanooga knoxville nashville tricities) ],
	Alaska => [ qw(anchorage) ],
	Arizona => [ qw(flagstaff phoenix prescott tucson yuma) ],
	Arkansas => [ qw(fayar fortsmith jonesboro littlerock memphis texarkana) ],
	California => [ qw(bakersfield chico fresno goldcountry humboldt inlandempire losangeles merced modesto monterey orangecounty palmsprings redding reno sacramento sandiego sfbay slo santabarbara stockton ventura visalia) ],
	Colorado => [ qw(boulder cosprings denver fortcollins pueblo rockies westslope)],
	Connecticut => [ qw(newlondon hartford newhaven nwct) ],
	Delaware => [ qw(delaware) ],
	DC => [ qw(washingtondc) ],
	Hawaii => [ qw(honolulu) ],
	Idaho => [ qw(boise eastidaho pullman spokane) ],
	Illinois => [ qw(bn carbondale chambana chicago peoria quadcities rockford springfield stlouis) ],
	Indiana => [ qw(bloomington evansville fortwayne indianapolis muncie southbend terrahaute tippecanoe chicago) ],
	Iowa => [ qw(ames cedarrapids desmoines dubuque iowacity omaha quadcities siouxcity) ],
	Kansas => [ qw(kansascity lawrence ksu topeka wichita) ],
	Louisiana => [ qw(batonrouge lafayette lakecharles neworleans shreveport) ],
	Maine => [ qw(maine) ],
	Maryland => [ qw(baltimore easternshore westmd) ],
	Massachusetts => [ qw(boston capecod southcoast westernmass worcester) ],
	Michigan => [ qw(annarbor centralmich detroit flint grandrapids jxn kalamazoo lansing nmi saginaw southbend up) ],
	Minnesota => [ qw(duluth fargo mankato minneapolis rmn stcloud) ],
	Missouri => [ qw(columbiamo joplin kansascity springfield stlouis) ],
	Montana => [ qw(montana) ],
	Nebraska => [ qw(grandisland lincoln omaha siouxcity) ],
	Nevada => [ qw(lasvegas reno)],
	NewHampshire => [ qw(nh) ],
	NewJersey => [ qw(cnj newjersey southjersey) ],
	NewMexico => [ qw(albuquerque lascruces roswell santafe) ],
	NewYork => [ qw(albany binghamton buffalo catskills chautauqua elmira hudsonvalley ithaca longisland newyork plattsburgh rochester syracuse utica watertown) ],
	NorthCarolina => [ qw(asheville boone charlotte eastnc fayetteville greensboro outerbanks raleigh wilmington winstonsalem) ],
	NorthDakota => [ qw(fargo nd) ],
	Ohio => [ qw(akroncanton athensohio cincinnati cleveland columbus dayton huntington limaohio mansfield parkersburg toledo wheeling youngstown) ],
	Oklahoma => [ qw(fortsmith lawton oklahomacity stillwater tulsa) ],
	Oregon => [ qw(bend corvallis eastoregon eugene medford oregoncoast portland salem) ],
	Pennsylvania => [ qw(altoona erie harrisburg lancaster allentown philadelphia pittsburgh poconos reading scranton pennstate york) ],
	RhodeIsland => [ qw(providence) ],
	SouthDakota => [ qw(sd) ],
	Texas => [ qw(dallas houston sanantonio austin beaumont brownsville) ],
	Utah => [ qw(logan ogden provo saltlakecity stgeorge) ],
	Vermont => [ qw(burlington) ],
	Virginia => [ qw(blacksburg charlottesville danville norfolk harrisonburg lynchburg richmond roanoke) ],
	Washington => [ qw(bellingham kpr pullman seattle spokane wenatchee yakima) ],
	WestVirginia => [ qw(charlestonwv huntington martinsburg morgantown parkersburg wheeling) ],
	Wisconsin => [ qw(appleton duluth eauclaire greenbay lacrosse madison milwaukee) ],
	Wyoming => [ qw(wyoming) ],
);

sub get_craigs {

	my $city = shift;

	# Download the page using get();.
	# my $content = get( "http://$city.craigslist.org/search/tls?query=$keyword" ) or die $!;
	print "city == $city\n";
	print "keyword == $keyword\n";
	print "category == $cat\n";
	print "http://$city.craigslist.org/search/$cat?query=$keyword \n";

	my $content = get( "http://$city.craigslist.org/search/$cat?query=$keyword" ) or die $!;

	# Split up the page blob into lines so that we can manipulate them.
	my @lines = split(/\n/, $content);

	foreach my $i (0 .. @lines)
	{
		# This is the key to the whole program, the returned listings are in rows
		# This is the item listing.
		# I tested this on bikes.
#                <p class="row">
#                        <span class="ih" id="images:3n63o53l45O25V35W4a8q669e2752037a111f.jpg">&nbsp;</span>
#                         Aug 26 - <a href="http://auburn.craigslist.org/bik/1920996795.html">Gary Fisher Mountain Bike  -</a>
#                         $950<font size="-1"> (Auburn, AL)</font> <span class="p"> pic</span><br class="c">
#                </p>
		if ((@lines[$i] =~ /href/) && (@lines[$i] =~ /$city/))
		{ 
			print "line == @lines[$i]\n"; 
			my $line = @lines[$i]; 
			print HTML "$line<br>\n";
		}
	}


}

#------------------------------------------------------------------------------
# This didn't really have to be a subroutine, just cleaning things up and making
# them modular.  Open the file.
#------------------------------------------------------------------------------
sub open_html_file {
        open (HTML,">$html")
        or die "Error: cant't open $html \n $!";
}

#------------------------------------------------------------------------------
# Close the file.
#------------------------------------------------------------------------------
sub close_html_file {
        close HTML or die "Error: can't close $html\n $!";
}


#------------------------------------------------------------------------------
# Main.
#------------------------------------------------------------------------------

open_html_file();

# Make html the header
print HTML "<html>\n <head>\n <titel>CraigsList Crawler 3000</title>\n </head>\n <body>\n <br>\n\n" ;

# Iterate through the hash of arrays
foreach my $key ( keys %states ) 
{
	print HTML "<br>$key<br>\n";
	foreach my $i ( 0 .. $#{ $states{$key} } ) 
	{
		print HTML"<br>$states{$key}[$i]<br>\n";
		get_craigs($states{$key}[$i]);
		sleep(5);
	}
        print "\n";
}


print HTML " </body>\n\n" ;
close_html_file();

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s