Update 2010-08-26
I have made changes to the script below as a result of some requests. The output should be easier to read.
I should also point out how to find the categories. The usage example that is output when you run the script with no command line switches is only an example, the script will search any category under the “for sale” heading of CraigsList. For instance, under “for sale” is the category “antiques” and when I click on it the link is below.
http://atlanta.craigslist.org/atq/
The category is “atq” in the URL and that is what you would put to search the “antiques” category with this script. The same construct applies if you would like to search “appliances” or any other category.
Usage: ./CLCrawler3000.pl category keyword Exmaple: ./CLCrawler3000.pl sys "mac+mini" Categories: sys == computers tls == tools bik == bike sad == system admin jobs
So if I wanted look for a Linux system administration job I would type in:
./CLCrawler3000.pl sad linux
And if I wanted a an armoire in the antique category I would run the script with:
./CLCrawler3000.pl atq armoire
Original Post
The name of this script was given by one of my work mates, Scott, when he started using it to search CraigsList. I wrote this script when I became frustrated with the functionality of CraigsList. I live in a small town and I wanted to search for items on CraigsList, however, I would have to search the larger cities around me in order to find items I needed. It didn’t matter to me whether I went to Atlanta, Birmingham or Huntsville, I was still going to have to drive, and when you are looking for bikes on CraigsList you might as well search all of Colorado, California and Texas. The script just grew from there.
I will say that CraigsList has changed its’ output format a couple of times since I wrote this script. I also have had to make changes depending upon the category I was searching. Like all scripts on the internet, your mileage may vary but I hope you find this script as useful as I have.
I would also like to apologize for the code listing. I just used the simple code tag because more fancy highlighting did not look very good.
If you download the script and just run it from the command line, it will give you sample usage. It also outputs a file, clcrawler.html, which you can open in your web browser to view the results.
Usage: ./CLCrawler3000.pl category keyword Exmaple: ./CLCrawler3000.pl sys "mac+mini" Categories: sys == computers tls == tools bik == bike sad == system admin jobs
#!/usr/bin/perl use strict; use LWP::Simple; use HTML::TokeParser; die " Usage: $0 category keyword\n Exmaple: $0 sys \"mac+mini\" \n Categories: \n sys == computers\n tls == tools\n bik == bike\n sad == system admin jobs\n " unless @ARGV; # This is the category my $cat = $ARGV[0] || "tls"; # This is the keyword you are looking for... my $keyword = $ARGV[1] || "surface+plate"; # This is the output file. my $html = "clcrawler3000.html"; # Define the arrays for each state to be passed into craigslist search, # by defining each state individually I can tailor my searches quicker. my %states = ( Alabama => [ qw(auburn bham columbusga huntsville mobile montgomery tuscaloosa) ], Florida => [ qw(daytona keys fortlauderdale fortmyers gainesville jacksonville lakeland miami ocala orlando pensacola sarasota spacecoast tallahassee tampa treasure westpalmbeach) ], Georgia => [ qw(atlanta columbusga athensga augusta macon savannah valdosta) ], Mississippi => [ qw(gulfport hattiesburg jackson northmiss) ], Kentucky => [ qw(bgky cincinnati huntington lexington louisville westky) ], SouthCarolina => [ qw(charleston columbia greenville hiltonhead myrtlebeach) ], Tennessee => [ qw(memphis chattanooga knoxville nashville tricities) ], Alaska => [ qw(anchorage) ], Arizona => [ qw(flagstaff phoenix prescott tucson yuma) ], Arkansas => [ qw(fayar fortsmith jonesboro littlerock memphis texarkana) ], California => [ qw(bakersfield chico fresno goldcountry humboldt inlandempire losangeles merced modesto monterey orangecounty palmsprings redding reno sacramento sandiego sfbay slo santabarbara stockton ventura visalia) ], Colorado => [ qw(boulder cosprings denver fortcollins pueblo rockies westslope)], Connecticut => [ qw(newlondon hartford newhaven nwct) ], Delaware => [ qw(delaware) ], DC => [ qw(washingtondc) ], Hawaii => [ qw(honolulu) ], Idaho => [ qw(boise eastidaho pullman spokane) ], Illinois => [ qw(bn carbondale chambana chicago peoria quadcities rockford springfield stlouis) ], Indiana => [ qw(bloomington evansville fortwayne indianapolis muncie southbend terrahaute tippecanoe chicago) ], Iowa => [ qw(ames cedarrapids desmoines dubuque iowacity omaha quadcities siouxcity) ], Kansas => [ qw(kansascity lawrence ksu topeka wichita) ], Louisiana => [ qw(batonrouge lafayette lakecharles neworleans shreveport) ], Maine => [ qw(maine) ], Maryland => [ qw(baltimore easternshore westmd) ], Massachusetts => [ qw(boston capecod southcoast westernmass worcester) ], Michigan => [ qw(annarbor centralmich detroit flint grandrapids jxn kalamazoo lansing nmi saginaw southbend up) ], Minnesota => [ qw(duluth fargo mankato minneapolis rmn stcloud) ], Missouri => [ qw(columbiamo joplin kansascity springfield stlouis) ], Montana => [ qw(montana) ], Nebraska => [ qw(grandisland lincoln omaha siouxcity) ], Nevada => [ qw(lasvegas reno)], NewHampshire => [ qw(nh) ], NewJersey => [ qw(cnj newjersey southjersey) ], NewMexico => [ qw(albuquerque lascruces roswell santafe) ], NewYork => [ qw(albany binghamton buffalo catskills chautauqua elmira hudsonvalley ithaca longisland newyork plattsburgh rochester syracuse utica watertown) ], NorthCarolina => [ qw(asheville boone charlotte eastnc fayetteville greensboro outerbanks raleigh wilmington winstonsalem) ], NorthDakota => [ qw(fargo nd) ], Ohio => [ qw(akroncanton athensohio cincinnati cleveland columbus dayton huntington limaohio mansfield parkersburg toledo wheeling youngstown) ], Oklahoma => [ qw(fortsmith lawton oklahomacity stillwater tulsa) ], Oregon => [ qw(bend corvallis eastoregon eugene medford oregoncoast portland salem) ], Pennsylvania => [ qw(altoona erie harrisburg lancaster allentown philadelphia pittsburgh poconos reading scranton pennstate york) ], RhodeIsland => [ qw(providence) ], SouthDakota => [ qw(sd) ], Texas => [ qw(dallas houston sanantonio austin beaumont brownsville) ], Utah => [ qw(logan ogden provo saltlakecity stgeorge) ], Vermont => [ qw(burlington) ], Virginia => [ qw(blacksburg charlottesville danville norfolk harrisonburg lynchburg richmond roanoke) ], Washington => [ qw(bellingham kpr pullman seattle spokane wenatchee yakima) ], WestVirginia => [ qw(charlestonwv huntington martinsburg morgantown parkersburg wheeling) ], Wisconsin => [ qw(appleton duluth eauclaire greenbay lacrosse madison milwaukee) ], Wyoming => [ qw(wyoming) ], ); sub get_craigs { my $city = shift; # Download the page using get();. # my $content = get( "http://$city.craigslist.org/search/tls?query=$keyword" ) or die $!; print "city == $city\n"; print "keyword == $keyword\n"; print "category == $cat\n"; print "http://$city.craigslist.org/search/$cat?query=$keyword \n"; my $content = get( "http://$city.craigslist.org/search/$cat?query=$keyword" ) or die $!; # Split up the page blob into lines so that we can manipulate them. my @lines = split(/\n/, $content); foreach my $i (0 .. @lines) { # This is the key to the whole program, the returned listings are in rows # This is the item listing. # I tested this on bikes. # <p class="row"> # <span class="ih" id="images:3n63o53l45O25V35W4a8q669e2752037a111f.jpg"> </span> # Aug 26 - <a href="http://auburn.craigslist.org/bik/1920996795.html">Gary Fisher Mountain Bike -</a> # $950<font size="-1"> (Auburn, AL)</font> <span class="p"> pic</span><br class="c"> # </p> if ((@lines[$i] =~ /href/) && (@lines[$i] =~ /$city/)) { print "line == @lines[$i]\n"; my $line = @lines[$i]; print HTML "$line<br>\n"; } } } #------------------------------------------------------------------------------ # This didn't really have to be a subroutine, just cleaning things up and making # them modular. Open the file. #------------------------------------------------------------------------------ sub open_html_file { open (HTML,">$html") or die "Error: cant't open $html \n $!"; } #------------------------------------------------------------------------------ # Close the file. #------------------------------------------------------------------------------ sub close_html_file { close HTML or die "Error: can't close $html\n $!"; } #------------------------------------------------------------------------------ # Main. #------------------------------------------------------------------------------ open_html_file(); # Make html the header print HTML "<html>\n <head>\n <titel>CraigsList Crawler 3000</title>\n </head>\n <body>\n <br>\n\n" ; # Iterate through the hash of arrays foreach my $key ( keys %states ) { print HTML "<br>$key<br>\n"; foreach my $i ( 0 .. $#{ $states{$key} } ) { print HTML"<br>$states{$key}[$i]<br>\n"; get_craigs($states{$key}[$i]); sleep(5); } print "\n"; } print HTML " </body>\n\n" ; close_html_file();