HarvestMan is a multi-threaded web-crawler written in Python. The primary method of running it is to edit the config file 'config.xml' and then call python on the harvestman.py file.

Where to find things

The HarvestMan code is installed both on bulba and on the daughter machines, in /opt/HarvestMan-1.4.6/. Your first config.xml file can be created by copying from below and pasting into a text file.

There is some documentation online at http://harvestmanontheweb.com/, particularly in the HarvestMan FAQ.

Why you might use it

HarvestMan is particularly good for web-crawls of a single domain. It can pretty quickly grab most of the public files from a domain and its subdomains (subject to some configuration options). There are also options for restricting the files it grabs to just those with paths matching a certain pattern or with text containing key words. However, it is not sufficiently sophisticated to try indexing the whole internet, for example, and even on particularly large domains it may take a few hours.

Similar alternatives

For some web-scraping jobs, it might make more sense to write your own small crawler, so you could tell it more specifically where to get its pages from.

With the Google API, you can make use of their indexing to do many of the tasks you might use a web-crawl for. You can retrieve a list of pages relevant to a query (restricted to a domain if you like), retrieve cached pages, and retrieve lists of inbound links.

The config file

<?xml version="1.0" encoding="utf-8"?>
<HarvestMan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://harvestman.freezope.org/schemas/HarvestMan.xsd">
  <config version="3.0" xmlversion="1.0">
    <projects combine="0">

      <project skip="0">
        <url>http://www.foo.com</url>
        <name>fooproject</name>
        <basedir>~/websites</basedir>
        <verbosity value="3"/>
      </project>

    </projects>

    <network>
      <proxy>
        <proxyserver></proxyserver>
        <proxyuser></proxyuser>
        <proxypasswd></proxypasswd>
        <proxyport value="80"/>
      </proxy>
      <urlserver status="0">
        <urlhost>localhost</urlhost>
        <urlport value="3081"/>
      </urlserver>
    </network>

    <download>
      <types>
        <html value="1"/>
        <images value="1"/>
        <javascript value="1"/>
        <javaapplet value="1"/>
        <forms value="0"/>
        <cookies value="1"/>
      </types>
      <cache status="1">
        <datacache value="1"/>
      </cache>
      <misc>
        <retries value="1"/>
      </misc>
    </download>

    <control>
      <links>
        <imagelinks value="1"/>
        <stylesheetlinks value="1"/>
      </links>
      <extent>
        <fetchlevel value="0"/>
        <extserverlinks value="0"/>
        <extpagelinks value="1"/>
        <depth value="10"/>
        <extdepth value="0"/>
        <subdomain value="1"/>
      </extent>
      <limits>
        <maxextservers value="0"/>
        <maxextdirs value="0"/>
        <maxfiles value="5000"/>
        <maxfilesize value="5048576"/>
        <connections value="10"/>
        <requests value="10"/>
        <timelimit value="-1"/>
      </limits>
      <rules>
        <robots value="1"/>
        <urlpriority></urlpriority>
        <serverpriority></serverpriority>
      </rules>
      <filters>
        <urlfilter></urlfilter>
        <serverfilter></serverfilter>
        <wordfilter></wordfilter>
        <junkfilter value="0"/>
      </filters>
    </control>

    <system>
      <workers status="1" size="10" timeout="200"/>
      <trackers value="10"/>
      <locale>C</locale>
      <fastmode value="1"/>
    </system>

    <files>
      <urllistfile></urllistfile>
      <urltreefile></urltreefile>
      <archive status="0" format="bzip"/>
      <urlheaders status="1" />
    </files>

    <indexer>
      <localise value="2"/>
    </indexer>

    <display>
      <browsepage value="1"/>
    </display>

  </config>

</HarvestMan>

There is a detailed description of the configuration options at HarvestMan Configuration File

Fetch levels

According to the FAQ,

A fetchlevel of "0" provides the maximum constraint for the user. This limits the download of files to all paths in the starting server, only inside and below the directory of the starting url.

For example, with a fetchlevel of zero, if you starting url is http://www.foo.com/bar/images/images.html, the program will download only those files inside the <images> sub-directory and directories below it, and no other file.

The next level, a fetch level of "1", again limits the download to the starting server (and sub-domains in it, if the sub-domain variable is not set), but does not allow it to crawl sites other than the starting server. In the above example, this will fetch all links in the server http://www.foo.com, encountered in the starting page.

A fetch level of "2" performs a fetching of all links in the starting server encountered in the starting url, as well as any links in outside (external) servers linked directly from pages in the starting server. It does not allow the program to crawl pages linked further away, i.e the second-level links linked from the external servers.

A fetch level of "3", performs a similar operation with the main difference that it acts like a combination of fetchlevels "0" and "2" minus "1". That is, it gets all links under the directory of the starting url and first level external links, but does not fetch links outside the directory of the starting url.

A fetch level of "4" gives the user no control over the levels of fetching, and the program will crawl whichever link is available to it unless not limited by other download control options like depth control, domain filters , url filters, file limits, maximum server limit etc.

In short we can summarize the above rules in the following download guidelines.

If you just want to download all links directly below the starting url, use a fetch level of zero.

If you want to download all links linked to the starting url in the same server , use a fetch level of one.

If you want to download all links directly below the starting url, and also first level links linked to other websites, use a fetch level of three.

If you want to download all links linked to the starting url in the same server, and also first level links linked to other websites, use a fetch level of two.

If you dont want to prescribe any limits, set a fetch level of four and tweak other download control options like depth fetching, file limits etc.

Now that that's all cleared up,

Running HarvestMan

So now you have a directory where HarvestMan can put its log files, and where you have placed a config.txt file with all the variables set how you like them. Now you do

python /usr/lib/python2.3/site-packages/HarvestMan/harvestman.py

To simplify your life, you might make a link to harvestman.py, and then call python on the link.

ln -s /usr/lib/python2.3/site-packages/HarvestMan/harvestman.py harvestman
python harvestman

Another alternative is to create an alias in your .bashrc

It is also possible to run the program by supplying arguments on the command line. For example,

python harvestman.py http://www.foo.com -b ~/mywebsites -V 3

This tells the program to start crawling http://www.foo.com with verbosity 3 and save the files under ~/mywebsites folder. Since no project name was specified, the program will use the domain name (www.foo.com) and create a project directory with that name.

It is also possible to run the program by passing in another configuration file. The -C option can be used for it.

python harvestman -C projdir/project.xml

May your harvest be plentiful.

None: HarvestMan (last edited 2008-04-20 05:09:38 by LucienCarroll)