Home > Code, Linux > CraigsList Crawler 3000

CraigsList Crawler 3000

September 1st, 2009 Leave a comment Go to comments

Update 2010-08-26
I have made changes to the script below as a result of some requests. The output should be easier to read.

I should also point out how to find the categories. The usage example that is output when you run the script with no command line switches is only an example, the script will search any category under the “for sale” heading of CraigsList. For instance, under “for sale” is the category “antiques” and when I click on it the link is below.

http://atlanta.craigslist.org/atq/

The category is “atq” in the URL and that is what you would put to search the “antiques” category with this script. The same construct applies if you would like to search “appliances” or any other category.

 Usage: ./CLCrawler3000.pl category keyword
 Exmaple: ./CLCrawler3000.pl sys "mac+mini"
 Categories:
 sys == computers
 tls == tools
 bik == bike
 sad == system admin jobs

So if I wanted look for a Linux system administration job I would type in:

 ./CLCrawler3000.pl sad linux

And if I wanted an armoire in the antique category I would run the script with:

 ./CLCrawler3000.pl atq armoire

Original Post
The name of this script was given by one of my work mates, Scott, when he started using it to search CraigsList. I wrote this script when I became frustrated with the functionality of CraigsList. I live in a small town and I wanted to search for items on CraigsList, however, I would have to search the larger cities around me in order to find items I needed. It didn’t matter to me whether I went to Atlanta, Birmingham or Huntsville, I was still going to have to drive, and when you are looking for bikes on CraigsList you might as well search all of Colorado, California and Texas. The script just grew from there.

I will say that CraigsList has changed its’ output format a couple of times since I wrote this script. I also have had to make changes depending upon the category I was searching. Like all scripts on the internet, your mileage may vary but I hope you find this script as useful as I have.

I would also like to apologize for the code listing. I just used the simple code tag because more fancy highlighting did not look very good.

If you download the script and just run it from the command line, it will give you sample usage. It also outputs a file, clcrawler.html, which you can open in your web browser to view the results.

 Usage: ./CLCrawler3000.pl category keyword
 Exmaple: ./CLCrawler3000.pl sys "mac+mini"
 Categories:
 sys == computers
 tls == tools
 bik == bike
 sad == system admin jobs
#!/usr/bin/perl

use strict;
use LWP::Simple;
use HTML::TokeParser;

die " Usage: $0 category keyword\n Exmaple: $0 sys \"mac+mini\" \n Categories: \n sys == computers\n tls == tools\n bik == bike\n sad == system admin jobs\n " unless @ARGV;

# This is the category
my $cat = $ARGV[0] || "tls";

# This is the keyword you are looking for...
my $keyword =  $ARGV[1] || "surface+plate";

# This is the output file.
my $html = "clcrawler3000.html";

# Define the arrays for each state to be passed into craigslist search,
# by defining each state individually I can tailor my searches quicker.
my %states = (
    Alabama => [ qw(auburn bham columbusga huntsville mobile montgomery tuscaloosa) ],
    Florida => [ qw(daytona keys fortlauderdale fortmyers gainesville jacksonville lakeland miami ocala orlando pensacola sarasota spacecoast tallahassee tampa treasure westpalmbeach) ],
    Georgia => [ qw(atlanta columbusga athensga augusta macon savannah valdosta) ],
    Mississippi => [ qw(gulfport hattiesburg jackson northmiss) ],
    Kentucky => [ qw(bgky cincinnati huntington lexington louisville westky) ],
    SouthCarolina => [ qw(charleston columbia greenville hiltonhead myrtlebeach) ],
    Tennessee => [ qw(memphis chattanooga knoxville nashville tricities) ],
    Alaska => [ qw(anchorage) ],
    Arizona => [ qw(flagstaff phoenix prescott tucson yuma) ],
    Arkansas => [ qw(fayar fortsmith jonesboro littlerock memphis texarkana) ],
    California => [ qw(bakersfield chico fresno goldcountry humboldt inlandempire losangeles merced modesto monterey orangecounty palmsprings redding reno sacramento sandiego sfbay slo santabarbara stockton ventura visalia) ],
    Colorado => [ qw(boulder cosprings denver fortcollins pueblo rockies westslope)],
    Connecticut => [ qw(newlondon hartford newhaven nwct) ],
    Delaware => [ qw(delaware) ],
    DC => [ qw(washingtondc) ],
    Hawaii => [ qw(honolulu) ],
    Idaho => [ qw(boise eastidaho pullman spokane) ],
    Illinois => [ qw(bn carbondale chambana chicago peoria quadcities rockford springfield stlouis) ],
    Indiana => [ qw(bloomington evansville fortwayne indianapolis muncie southbend terrahaute tippecanoe chicago) ],
    Iowa => [ qw(ames cedarrapids desmoines dubuque iowacity omaha quadcities siouxcity) ],
    Kansas => [ qw(kansascity lawrence ksu topeka wichita) ],
    Louisiana => [ qw(batonrouge lafayette lakecharles neworleans shreveport) ],
    Maine => [ qw(maine) ],
    Maryland => [ qw(baltimore easternshore westmd) ],
    Massachusetts => [ qw(boston capecod southcoast westernmass worcester) ],
    Michigan => [ qw(annarbor centralmich detroit flint grandrapids jxn kalamazoo lansing nmi saginaw southbend up) ],
    Minnesota => [ qw(duluth fargo mankato minneapolis rmn stcloud) ],
    Missouri => [ qw(columbiamo joplin kansascity springfield stlouis) ],
    Montana => [ qw(montana) ],
    Nebraska => [ qw(grandisland lincoln omaha siouxcity) ],
    Nevada => [ qw(lasvegas reno)],
    NewHampshire => [ qw(nh) ],
    NewJersey => [ qw(cnj newjersey southjersey) ],
    NewMexico => [ qw(albuquerque lascruces roswell santafe) ],
    NewYork => [ qw(albany binghamton buffalo catskills chautauqua elmira hudsonvalley ithaca longisland newyork plattsburgh rochester syracuse utica watertown) ],
    NorthCarolina => [ qw(asheville boone charlotte eastnc fayetteville greensboro outerbanks raleigh wilmington winstonsalem) ],
    NorthDakota => [ qw(fargo nd) ],
    Ohio => [ qw(akroncanton athensohio cincinnati cleveland columbus dayton huntington limaohio mansfield parkersburg toledo wheeling youngstown) ],
    Oklahoma => [ qw(fortsmith lawton oklahomacity stillwater tulsa) ],
    Oregon => [ qw(bend corvallis eastoregon eugene medford oregoncoast portland salem) ],
    Pennsylvania => [ qw(altoona erie harrisburg lancaster allentown philadelphia pittsburgh poconos reading scranton pennstate york) ],
    RhodeIsland => [ qw(providence) ],
    SouthDakota => [ qw(sd) ],
    Texas => [ qw(dallas houston sanantonio austin beaumont brownsville) ],
    Utah => [ qw(logan ogden provo saltlakecity stgeorge) ],
    Vermont => [ qw(burlington) ],
    Virginia => [ qw(blacksburg charlottesville danville norfolk harrisonburg lynchburg richmond roanoke) ],
    Washington => [ qw(bellingham kpr pullman seattle spokane wenatchee yakima) ],
    WestVirginia => [ qw(charlestonwv huntington martinsburg morgantown parkersburg wheeling) ],
    Wisconsin => [ qw(appleton duluth eauclaire greenbay lacrosse madison milwaukee) ],
    Wyoming => [ qw(wyoming) ],
);

sub get_craigs {

    my $city = shift;

    # Download the page using get();.
    # my $content = get( "http://$city.craigslist.org/search/tls?query=$keyword" ) or die $!;
    print "city == $city\n";
    print "keyword == $keyword\n";
    print "category == $cat\n";
    print "http://$city.craigslist.org/search/$cat?query=$keyword \n";

    my $content = get( "http://$city.craigslist.org/search/$cat?query=$keyword" ) or die $!;

    # Split up the page blob into lines so that we can manipulate them.
    my @lines = split(/\n/, $content);

    foreach my $i (0 .. @lines)
    {
        # This is the key to the whole program, the returned listings are in rows
        # This is the item listing.
        # I tested this on bikes.
#                <p class="row">
#                        <span class="ih" id="images:3n63o53l45O25V35W4a8q669e2752037a111f.jpg">&nbsp;</span>
#                         Aug 26 - <a href="http://auburn.craigslist.org/bik/1920996795.html">Gary Fisher Mountain Bike  -</a>
#                         $950<font size="-1"> (Auburn, AL)</font> <span class="p"> pic</span><br class="c">
#                </p>
        if ((@lines[$i] =~ /href/) && (@lines[$i] =~ /$city/))
        {
            print "line == @lines[$i]\n";
            my $line = @lines[$i];
            print HTML "$line<br>\n";
        }
    }


}

#------------------------------------------------------------------------------
# This didn't really have to be a subroutine, just cleaning things up and making
# them modular.  Open the file.
#------------------------------------------------------------------------------
sub open_html_file {
        open (HTML,">$html")
        or die "Error: cant't open $html \n $!";
}

#------------------------------------------------------------------------------
# Close the file.
#------------------------------------------------------------------------------
sub close_html_file {
        close HTML or die "Error: can't close $html\n $!";
}


#------------------------------------------------------------------------------
# Main.
#------------------------------------------------------------------------------

open_html_file();

# Make html the header
print HTML "<html>\n <head>\n <titel>CraigsList Crawler 3000</title>\n </head>\n <body>\n <br>\n\n" ;

# Iterate through the hash of arrays
foreach my $key ( keys %states )
{
    print HTML "<br>$key<br>\n";
    foreach my $i ( 0 .. $#{ $states{$key} } )
    {
        print HTML"<br>$states{$key}[$i]<br>\n";
        get_craigs($states{$key}[$i]);
        sleep(5);
    }
        print "\n";
}


print HTML " </body>\n\n" ;
close_html_file();
Categories: Code, Linux Tags:
  1. October 4th, 2009 at 12:41 | #1

    Amazing! Not clear for me, how offen you updating your http://www.chainringcircus.org.
    Have a nice day

  2. October 8th, 2009 at 00:29 | #2

    Hi,
    Interesting, I`ll quote it on my site later.
    Pett

  3. July 13th, 2010 at 16:18 | #3

    I’m having a bit of trouble with this code. Could you possible dumb it down a bit? What program do I use to run this with?

  4. jud
    July 14th, 2010 at 15:53 | #4

    @Tyler
    I’m sorry to hear you are having trouble. I just ran it looking for a systadmin administrator job that had CCNP in it and it worked. Not beautifully mind you, but it does work.

    Here is an example of how to run it:

    jud@litespeed:~/Stuff/Scripts/Web$ ./CLCrawler3000.pl sad "CCNP"

    Then just go to the directory from which the script was run and open the file clcrawler3000.html in your browser.

    I use Linux, specifically Ubuntu, and the script is written in Perl. I would be glad to dumb it down a bit if you could offer me suggestions, from my frame of reference I thought it was pretty simple.

  5. August 26th, 2010 at 04:21 | #5

    Hi,
    it is a great idea! May I suggest a few improvements which I think will make the program excellent?
    1. Of course, expand the categories. I know it is going to be quite tedious, but people have different needs.
    2. Allow to specify the city from command line. This is just as simple as a line of code. Then, if you want the whole serch, just make it leave a blank.
    3. The links in the html file do not work so well. I get links like:

    http://hiltonhead.craigslist.orghttp//charleston.craigslist.org/tls/1862144592.html

    Evidently, there is a small problem there. Also, most of the links are incomplete and cannot be accessed.

    4. A script is never user-friendly. I suggest you to write a very simple interface for it. I know it may be beyond the scope (and the time you want to devot to it), but the idea is great and many people I know would use it.

    The script is really cool. I just started a blog with “Useful, practical things for linux” and I will mention it.

    Cheers,
    S.

  6. jud
    August 26th, 2010 at 21:31 | #6

    @Simone
    Thank you for your comments. I cleaned up the output as a result. I’m just going to parrot your numbers in my answers for easy reference.

    1. The script will search any category, I tell you how in the “Update.”
    2. If users want to only search one city, that is what CraigsList was designed to do.
    3. Fixed. Please download the new .tar file.
    4. I think CraigsList did a great job with their interface, simple and intuitive. My only problem was that it did not search multiple cities.

    Hope that helps.

    Jud

  7. April 15th, 2011 at 04:54 | #7

    Excellent read, I just passed this onto a colleague who was doing some research on that. And he actually bought me lunch because I found it for him smile Therefore let me rephrase that: Thank you for lunch!

  8. Jud
    April 15th, 2011 at 09:10 | #8

    @Sterling Kap
    Good to hear, thanks for comment.

    Jud

  9. July 27th, 2012 at 05:16 | #9

    I have been exploring for a little bit for any high-quality articles or blog posts on this sort
    of house . Exploring in Yahoo I at last stumbled upon this website.

    Reading this info So i’m satisfied to show that I’ve a very good uncanny feeling I discovered just
    what I needed. I such a lot certainly will make certain to don?
    t forget this website and provides it a look on a constant basis.

  1. No trackbacks yet.