Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1525

Generator to record external links even when db.ignore.external.links set to true

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Auto Closed
    • None
    • 2.5
    • generator
    • None

    Description

      When fetching pages from specific domains we have various options e.g. use urlfilters, set the above property to true before injecting urls into the webdb etc. However with the former, it is recognised that complex regex can slow down processing and with the latter it means we disregard a number of urls which could potentially become useful in the future.
      Unfortunately there is no way to record external links encountered for future processing, although the wiki suggests that a very small patch to the generator code can allow you to log these links to hadoop.log. although this is better, a more robusts storage mechanism would be preferred. This may tie in with custom counters we've already specified or may require new counters to be implemented.

      Attachments

        1. nutch-logExternal.patch
          0.8 kB
          Dmitry Cherniachenko

        Activity

          People

            Unassigned Unassigned
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: