Details
-
New Feature
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.13
-
None
-
None
-
Patch Available
Description
Orphan scoring filter that determines whether a page has become orphaned, e.g. it has no more other pages linking to it. If a page hasn't been linked to after markGoneAfter seconds, the page is marked as gone and is then removed by an indexer. If a page hasn't been linked to after markOrphanAfter seconds, the page is removed from the CrawlDB.
Note: if you have this plugin enabled you MUST make sure you visit 'almost' every URL at least once within the refetch interval. If you don't, non-orphans may be marked as orphan and get deleted! You can use NUTCH-2368 to make sure this is the case.
Attachments
Attachments
Issue Links
- requires
-
NUTCH-1921 Optionally disable HTTP if-modified-since header
- Closed
-
NUTCH-1913 LinkDB to implement db.ignore.external.links
- Closed
- links to