[NUTCH-649] Log list of files found but not crawled. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Auto Closed
Affects Version/s: None
Fix Version/s: 2.5
Component/s: fetcher
Labels:
None
Environment:

any

Description

I use Nutch to find the location of executables on the web, but we do not download the executables with Nutch. In order to get nutch to give the location of files without downloading the files, I had to make a very small patch to the code, but I think this change might be useful to others also. The patch just logs files that are being filtered at the info level, although perhaps it should be at the debug level.

I have included a svn diff with this change. Use cases would be to both use as a diagnostic tool (let's see what we are skipping) as well as a way to find content and links pointed to by a page or site without having to actually download that content.

Index: ParseOutputFormat.java
===================================================================
— ParseOutputFormat.java (revision 593619)
+++ ParseOutputFormat.java (working copy)
@@ -193,17 +193,20 @@
toHost = null;
}
if (toHost == null || !toHost.equals(fromHost))

{ // external links + LOG.info("filtering externalLink " + toUrl + " linked to by " + fromUrl); + continue; // skip it }

}
try {
toUrl = normalizers.normalize(toUrl,
URLNormalizers.SCOPE_OUTLINK); // normalize the url

toUrl = filters.filter(toUrl); // filter the url
if (toUrl == null) { - continue; - }
} catch (Exception e)
Unknown macro: {++ if (filters.filter(toUrl) == null) { // filter the url + LOG.info("filtering content " + toUrl + " linked to by " + fromUrl); + continue; + }+ }

catch (Exception e)
{ continue; }
CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-649.2.x.patch
29/Apr/13 04:16
0.7 kB
Tejas Patil
NUTCH-649.trunk.patch
29/Apr/13 04:16
0.9 kB
Tejas Patil

Activity

People

Assignee:: Unassigned

Reporter:: Jim

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Aug/08 00:33

Updated:: 13/Oct/19 22:36

Resolved:: 13/Oct/19 22:36