Description
Crawling a domain http://www.mercenarytrader.com which redirects to https://members.mercenarytrader.com which doesn't get followed by Nutch even though 'db.ignore.external.links' is set to 'true' and 'db.ignore.external.links.mode' is set to 'byDomain'.
The bug is in FetcherThread where the comparison is by host and not by domain
String origHost = new URL(urlString).getHost().toLowerCase();
> String newHost = new URL(newUrl).getHost().toLowerCase();
> if (ignoreExternalLinks) {
> if (!origHost.equals(newHost)) {
> if (LOG.isDebugEnabled())
> return null;
> }
> }
Attachments
Issue Links
- is related to
-
NUTCH-2069 Ignore external links based on domain
- Closed
-
NUTCH-2216 db.ignore.*.links to optionally follow internal redirects
- Closed
- links to