[NUTCH-784] CrawlDBScanner - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1
Component/s: None
Labels:
None

Patch Info:

Patch Available

Description

The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.

The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB.

Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>

regex: regular expression on the crawldb key
-s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
-text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat

for instance the command below :
./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text

will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-784.patch
01/Feb/10 14:33
6 kB
Julien Nioche

Issue Links

is related to

NUTCH-806 Merge CrawlDBScanner with CrawlDBReader

Closed

Activity

People

Assignee:: Julien Nioche

Reporter:: Julien Nioche

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 01/Feb/10 14:32

Updated:: 30/Mar/10 04:15

Resolved:: 29/Mar/10 12:12