Details
Description
When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't exist anymore and return 404).
This issue creates a new command in the indexer that scans for WebPages with ProtocolStatusCodes.NOTFOUND and issues delete commands to Solr.
Attachments
Attachments
Issue Links
- incorporates
-
NUTCH-987 Support HTTP auth for Solr communication
- Closed
-
NUTCH-1036 Solr jobs should increment counters in Reporter
- Closed
- is related to
-
NUTCH-963 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
- Closed
- relates to
-
NUTCH-1000 Add option not to commit to Solr
- Closed