public marks

PUBLIC MARKS from dcancel with tag webcrawler

August 2006

Crawl-By-Example

Crawl-By-Example project is improving crawler ability to find useful and interesting pages, a plugin to the Heritrix crawler.

Ariel

a library that allows you to extract information from semi-structured documents (such as websites). Ariel will use a small number of labeled examples to generate and learn effective extraction rules.

RDig - Ferret based full text search for web sites

RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing.

February 2006

Heritrix

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project

dcancel's TAGS related to tag webcrawler

opensource +   ruby +   rubyonrails +   screenscraping +   SearchEngine +   software +   thearchive +