[NUTCH-1525] Generator to record external links even when db.ignore.external.links set to true - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Auto Closed
Affects Version/s: None
Fix Version/s: 2.5
Component/s: generator
Labels:
None

Description

When fetching pages from specific domains we have various options e.g. use urlfilters, set the above property to true before injecting urls into the webdb etc. However with the former, it is recognised that complex regex can slow down processing and with the latter it means we disregard a number of urls which could potentially become useful in the future.
Unfortunately there is no way to record external links encountered for future processing, although the wiki suggests that a very small patch to the generator code can allow you to log these links to hadoop.log. although this is better, a more robusts storage mechanism would be preferred. This may tie in with custom counters we've already specified or may require new counters to be implemented.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

nutch-logExternal.patch
14/Feb/14 09:25
0.8 kB
Dmitry Cherniachenko

Activity

People

Assignee:: Unassigned

Reporter:: Lewis John McGibbney

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Jan/13 18:29

Updated:: 13/Oct/19 22:35

Resolved:: 13/Oct/19 22:35