Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-612

URL filtering is always disabled in Generator when invoked by Crawl

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.0
    • Component/s: generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator even if 'crawl.generate.filter' is set to true in the configuration file.

      The problem is that in the Generator's generate method, the following code unconditionally sets the filter value of the job to whatever is passed to it:-

      job.setBoolean(CRAWL_GENERATE_FILTER, filter);

      The code in Crawl.java always passes this as false.

      This has been fixed by exposing an overloaded generate method which takes only the 5 arguments that Crawl needs to set. This overloaded method reads the configuration and sets the filter value appropriately.

        Attachments

          Activity

            People

            • Assignee:
              ab Andrzej Bialecki
              Reporter:
              susam Susam Pal
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: