Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2328

GeneratorJob does not generate anything on second run

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Auto Closed
    • Affects Version/s: 2.2, 2.3, 2.2.1, 2.3.1
    • Fix Version/s: 2.5
    • Component/s: generator
    • Environment:

      Ubuntu 16.04 / Hadoop 2.7.1

    • Patch Info:
      Patch Available
    • Flags:
      Patch, Important

      Description

      Given a topN parameter (ie 10) the GeneratorJob will fail to generate anything new on the subsequent runs within the same process space.
      To reproduce the issue submit the GeneratorJob twice one after another to the M/R framework. Second time will say it generated 0 URLs.
      This issue is due to the usage of the static count field (org.apache.nutch.crawl.GeneratorReducer#count) to determine if the topN value has been reached.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                arthur-evozon Arthur B
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified