Details
Description
See discussion in NUTCH-2335:
- add counters for removed items from CrawlDb:
Injector: Total urls removed from CrawlDb by filters: 2 Injector: Total urls with status gone removed from CrawlDb (db.update.purge.404): 0
- add -Ddb.update.purge.404=true to command-line help:
Usage: Injector [-D...] <crawldb> <url_dir> [-overwrite|-update] [-noFilter] [-noNormalize] [-filterNormalizeAll] ... -D... set or overwrite configuration property (property=value) -Ddb.update.purge.404=true remove URLs with status gone (404) from CrawlDb
Attachments
Issue Links
- is related to
-
NUTCH-2335 Injector not to filter and normalize existing URLs in CrawlDb
- Closed
- links to