[NUTCH-2463] Enable sampling CrawlDB - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 1.14
Component/s: crawldb
Labels:
None

Description

CrawlDB can grow to contain billions of records. When that happens readdb -dump is pretty useless, and readdb -topN can run for ages (and does not provide a statistically correct sample).
We should add a parameter -sample to readdb -dump which is followed by a number between 0 and 1, and only that fraction of records from the CrawlDB will be processed.
The sample should be statistically random, and all the other filters should be applied on the sampled records.