Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-612

URL filtering is always disabled in Generator when invoked by Crawl

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.0.0
    • generator
    • None
    • Patch Available

    Description

      When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator even if 'crawl.generate.filter' is set to true in the configuration file.

      The problem is that in the Generator's generate method, the following code unconditionally sets the filter value of the job to whatever is passed to it:-

      job.setBoolean(CRAWL_GENERATE_FILTER, filter);

      The code in Crawl.java always passes this as false.

      This has been fixed by exposing an overloaded generate method which takes only the 5 arguments that Crawl needs to set. This overloaded method reads the configuration and sets the filter value appropriately.

      Attachments

        1. NUTCH-612v0.1.patch
          2 kB
          Susam Pal

        Activity

          People

            ab Andrzej Bialecki
            susam Susam Pal
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: