Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2144

Plugin to override db.ignore.external to exempt interesting external domain URLs

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 1.12
    • crawldb, fetcher
    • None
    • Patch Available
    • Patch

    Description

      Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true) to fetch static resources from external domains.
      The generalized version of this: This plugin should permit interesting URLs from external domains (by overriding db.ignore.external). The interesting urls are decided from a combination of regex and mime-type rules.

      Concrete use case:
      When using Nutch to crawl images from a set of domains, the crawler needs to fetch all images which may be linked from CDNs and other domains. In this scenario, allowing all external links and then writing hundreds of regular expressions is not feasible for large number of domains.

      Attachments

        1. ignore-exempt.patch
          34 kB
          Thamme Gowda
        2. ignore-exempt.patch
          92 kB
          Thamme Gowda

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            chrismattmann Chris A. Mattmann
            thammegowda Thamme Gowda
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment