Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-612

URL filtering is always disabled in Generator when invoked by Crawl

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.0
    • Component/s: generator
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator even if 'crawl.generate.filter' is set to true in the configuration file.

      The problem is that in the Generator's generate method, the following code unconditionally sets the filter value of the job to whatever is passed to it:-

      job.setBoolean(CRAWL_GENERATE_FILTER, filter);

      The code in Crawl.java always passes this as false.

      This has been fixed by exposing an overloaded generate method which takes only the 5 arguments that Crawl needs to set. This overloaded method reads the configuration and sets the filter value appropriately.

        Activity

        Hide
        susam Susam Pal added a comment -

        Attached patch to fix the bug. This modifies Crawl.java and Generator.java.

        Show
        susam Susam Pal added a comment - Attached patch to fix the bug. This modifies Crawl.java and Generator.java.
        Hide
        ab Andrzej Bialecki added a comment -

        Patch committed to trunk rev. 637114. Thank you!

        Show
        ab Andrzej Bialecki added a comment - Patch committed to trunk rev. 637114. Thank you!
        Hide
        hudson Hudson added a comment -
        Show
        hudson Hudson added a comment - Integrated in Nutch-trunk #390 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/390/ )

          People

          • Assignee:
            ab Andrzej Bialecki
            Reporter:
            susam Susam Pal
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development