Description
CrawlDB can grow to contain billions of records. When that happens readdb -dump is pretty useless, and readdb -topN can run for ages (and does not provide a statistically correct sample).
We should add a parameter -sample to readdb -dump which is followed by a number between 0 and 1, and only that fraction of records from the CrawlDB will be processed.
The sample should be statistically random, and all the other filters should be applied on the sampled records.
Attachments
Issue Links
- links to