Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2627

Fetcher to optionally filter URLs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Implemented
    • 1.16
    • 1.16
    • fetcher
    • None
    • Patch

    Description

      When running a large web crawl it happens that a webadmin requests to immediately stop crawling a certain domain. The default Nutch workflow applies URL filters only to seeds and outlinks. Applying filters during fetch list generation is expensive with a large CrawlDb (fetch lists are usually much shorter). Allowing the fetcher to optionally filter URLs would allow to apply changed filter rules to the next launched fetcher job even if the the segment has been already created (esp., if multiple segments are generated in one turn).

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: