Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2536

GeneratorReducer.count is a static variable

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 2.4
    • Component/s: generator
    • Labels:
    • Environment:

      Non-distributed, single node, standalone Nutch jobs run in a sinlge JVM with HBase as the data store. 2.3.1

      Description

      The count field of the GeneratorReducer class is a static field. This means that if the GeneratorJob is run multiple times within the same JVM, it will count all the webpages generated across all batches.

      The count field is checked against the GeneratorJob's topN configuration variable, which is described as:

      "top threshold for maximum number of URLs permitted in a batch"

      I understand this to mean that EACH batch should be capped at the topN value, not ALL batches.

      This isn't a problem with the way that Nutch is typically used because the script starts a new JVM each time. I'm not using the script, I'm calling the java classes directly (using the ToolRunner) within an existing JVM, so I'm categorizing this as an SDK issue.

      Changing the field to be non-static will not affect the behavior of the class as its run by the script.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                bvachon Ben Vachon
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 2.4h
                  2.4h
                  Remaining:
                  Remaining Estimate - 2.4h
                  2.4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified