Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1525

Generator to record external links even when db.ignore.external.links set to true

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Auto Closed
    • Affects Version/s: None
    • Fix Version/s: 2.5
    • Component/s: generator
    • Labels:
      None

      Description

      When fetching pages from specific domains we have various options e.g. use urlfilters, set the above property to true before injecting urls into the webdb etc. However with the former, it is recognised that complex regex can slow down processing and with the latter it means we disregard a number of urls which could potentially become useful in the future.
      Unfortunately there is no way to record external links encountered for future processing, although the wiki suggests that a very small patch to the generator code can allow you to log these links to hadoop.log. although this is better, a more robusts storage mechanism would be preferred. This may tie in with custom counters we've already specified or may require new counters to be implemented.

        Attachments

        1. nutch-logExternal.patch
          0.8 kB
          Dmitry Cherniachenko

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lewismc Lewis John McGibbney
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: