Description
In ParseOutputFormat, I see a code block:
// collect outlinks for subsequent db update Outlink[] links = parseData.getOutlinks(); int outlinksToStore = Math.min(maxOutlinks, links.length); if (ignoreExternalLinks) { try { fromHost = new URL(fromUrl).getHost().toLowerCase(); } catch (MalformedURLException e) { fromHost = null; } } else { fromHost = null; }
The if(ignoreExternalLinks) part then gets subsequently set and
reset in the ensuing for loop:
int validCount = 0; CrawlDatum adjust = null; List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, CrawlDatum>>(outlinksToStore); List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore); for (int i = 0; i < links.length && validCount < outlinksToStore; i++) { String toUrl = links[i].getToUrl(); // ignore links to self (or anchors within the page) if (fromUrl.equals(toUrl)) { continue; } if (ignoreExternalLinks) { try { toHost = new URL(toUrl).getHost().toLowerCase(); } catch (MalformedURLException e) { toHost = null; } if (toHost == null || !toHost.equals(fromHost)) { // external links continue; // skip it } }
Isn't that redundant? I don't think the first if block is needed.
Attachments
Issue Links
- is part of
-
NUTCH-1184 Fetcher to parse and follow Nth degree outlinks
- Closed