Description
We need an option db.ignore.internal.links that operates in FetcherThread, just like db.ignore.external.links. It already exists but it only used by the LinkDB, and defaults to true, which is no good option for FetcherThread.
I propose to make a clear distinction between which are used for LinkDB or not. Most options used by LinkDB already use the right prefix but db.ignore.*.links, db.max.inlinks and db.max.anchor.length not yet.
This patch will rename those options to linkdb.* prefixes so afterwards we can implement db.ignore.internal.links that operates in FetcherThread, just like db.ignore.external.links.
This will introduce a change in default parameters. Please comment.
How to upgrade from earlier releases
- replace your old conf/nutch-default.xml with the conf/nutch-default.xml from Nutch 1.12 release
- if you use LinkDB (e.g. invertlinks) and modified parameters db.max.inlinks and/or db.max.anchor.length and/or db.ignore.internal.links, rename those parameters to linkdb.max.inlinks and linkdb.max.anchor.length and linkdb.ignore.internal.links
- db.ignore.internal.links and db.ignore.external.links now operate on the CrawlDB only
- linkdb.ignore.internal.links and linkdb.ignore.external.links now operate on the LinkDB only
Attachments
Attachments
Issue Links
- is depended upon by
-
NUTCH-2221 Introduce db.ignore.internal.links to FetcherThread
- Closed
- is related to
-
NUTCH-2216 db.ignore.*.links to optionally follow internal redirects
- Closed