Nutch
  1. Nutch
  2. NUTCH-1074

topN is ignored with maxNumSegments

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: generator
    • Labels:
      None

      Description

      When generating segments with topN and maxNumSegments, topN is not respected. It looks like the first generated segment contains topN * maxNumSegments of URLs's, at least the number of map input records roughly matches.

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment - - edited

          Finally got some numbers to share from a running test:

          maxNumSegments = 3
          topN = 250.000
          Selector reduce output records = 750.000
          happen
          The above looks fine. The generator selects exactly numSegments * topN records to be consumed by the following numSegments partitioners. Here's the number of output reducer records of the following three partitioned segments in order:

          1: 471.428
          2: 171.562
          3: 107.010

          The strange thing is that the number of reduce output records exactly matches the total number of map input records. This is not what i had expected. The generator partitions by host and has a limit on the number of hosts per queue so i would expect each segment to contain slightly less records than topN but certainly not more than topN.

          What should we exactly expect?

          Show
          Markus Jelsma added a comment - - edited Finally got some numbers to share from a running test: maxNumSegments = 3 topN = 250.000 Selector reduce output records = 750.000 happen The above looks fine. The generator selects exactly numSegments * topN records to be consumed by the following numSegments partitioners. Here's the number of output reducer records of the following three partitioned segments in order: 1: 471.428 2: 171.562 3: 107.010 The strange thing is that the number of reduce output records exactly matches the total number of map input records. This is not what i had expected. The generator partitions by host and has a limit on the number of hosts per queue so i would expect each segment to contain slightly less records than topN but certainly not more than topN. What should we exactly expect?
          Hide
          Robert Thomson added a comment - - edited

          When generator.max.count is set, the Generator.Selector reduce function partitions records so that each segment contains up to the set number of entries per host. The relative size of resulting segments will depend on the distribution of hosts in the crawldb. topN only limits the mean size of the segments.

          If generator.max.count is not set, each segment will contain topN records.

          Anyway, here's my fix. When using generator.max.count, each segment will contain up to topN records with at most generator.max.count from any single host.

          Index: src/java/org/apache/nutch/crawl/Generator.java
          ===================================================================
          --- src/java/org/apache/nutch/crawl/Generator.java      (revision 1172165)
          +++ src/java/org/apache/nutch/crawl/Generator.java      (working copy)
          @@ -115,6 +115,7 @@
               private long limit;
               private long count;
               private HashMap<String,int[]> hostCounts = new HashMap<String,int[]>();
          +    private int segCounts[];
               private int maxCount;
               private boolean byDomain = false;
               private Partitioner<Text,Writable> partitioner = new URLPartitioner();
          @@ -155,6 +156,7 @@
                 schedule = FetchScheduleFactory.getFetchSchedule(job);
                 scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float.NaN);
                 maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
          +      segCounts = new int[maxNumSegments];
               }
           
               public void close() {}
          @@ -269,6 +271,12 @@
                     // increment hostCount
                     hostCount[1]++;
           
          +          // check if topN reached, select next segment if it is
          +          while (segCounts[hostCount[0]-1] >= limit && hostCount[0] < maxNumSegments) {
          +            hostCount[0]++;
          +            hostCount[1] = 0;
          +          }
          +
                     // reached the limit of allowed URLs per host / domain
                     // see if we can put it in the next segment?
                     if (hostCount[1] > maxCount) {
          @@ -285,7 +293,11 @@
                       }
                     }
                     entry.segnum = new IntWritable(hostCount[0]);
          -        } else entry.segnum = new IntWritable(currentsegmentnum);
          +          segCounts[hostCount[0]-1]++;
          +        } else {
          +          entry.segnum = new IntWritable(currentsegmentnum);
          +          segCounts[currentsegmentnum-1]++;
          +        }
           
                   output.collect(key, entry);
          
          Show
          Robert Thomson added a comment - - edited When generator.max.count is set, the Generator.Selector reduce function partitions records so that each segment contains up to the set number of entries per host. The relative size of resulting segments will depend on the distribution of hosts in the crawldb. topN only limits the mean size of the segments. If generator.max.count is not set, each segment will contain topN records. Anyway, here's my fix. When using generator.max.count, each segment will contain up to topN records with at most generator.max.count from any single host. Index: src/java/org/apache/nutch/crawl/Generator.java =================================================================== --- src/java/org/apache/nutch/crawl/Generator.java (revision 1172165) +++ src/java/org/apache/nutch/crawl/Generator.java (working copy) @@ -115,6 +115,7 @@ private long limit; private long count; private HashMap< String , int []> hostCounts = new HashMap< String , int []>(); + private int segCounts[]; private int maxCount; private boolean byDomain = false ; private Partitioner<Text,Writable> partitioner = new URLPartitioner(); @@ -155,6 +156,7 @@ schedule = FetchScheduleFactory.getFetchSchedule(job); scoreThreshold = job.getFloat(GENERATOR_MIN_SCORE, Float .NaN); maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); + segCounts = new int [maxNumSegments]; } public void close() {} @@ -269,6 +271,12 @@ // increment hostCount hostCount[1]++; + // check if topN reached, select next segment if it is + while (segCounts[hostCount[0]-1] >= limit && hostCount[0] < maxNumSegments) { + hostCount[0]++; + hostCount[1] = 0; + } + // reached the limit of allowed URLs per host / domain // see if we can put it in the next segment? if (hostCount[1] > maxCount) { @@ -285,7 +293,11 @@ } } entry.segnum = new IntWritable(hostCount[0]); - } else entry.segnum = new IntWritable(currentsegmentnum); + segCounts[hostCount[0]-1]++; + } else { + entry.segnum = new IntWritable(currentsegmentnum); + segCounts[currentsegmentnum-1]++; + } output.collect(key, entry);
          Hide
          Markus Jelsma added a comment -

          Yes! I overlooked generate.max.count and you're right. Could you attach your patch to the issue with a flag for approval of inclusion in Nutch? So we can test it more and include if all goes well.

          Show
          Markus Jelsma added a comment - Yes! I overlooked generate.max.count and you're right. Could you attach your patch to the issue with a flag for approval of inclusion in Nutch? So we can test it more and include if all goes well.
          Hide
          Robert Thomson added a comment -

          Patch to make generator.max.count and topN work together

          Show
          Robert Thomson added a comment - Patch to make generator.max.count and topN work together
          Hide
          Markus Jelsma added a comment -

          Committed for 1.4 in rev 1174689. Thanks Robert for contributing the patch.

          Show
          Markus Jelsma added a comment - Committed for 1.4 in rev 1174689. Thanks Robert for contributing the patch.
          Hide
          Hudson added a comment -

          Integrated in Nutch-branch-1.4 #15 (See https://builds.apache.org/job/Nutch-branch-1.4/15/)
          NUTCH-1074 topN is ignored with maxNumSegments and generate.max.count

          markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174689
          Files :

          • /nutch/branches/branch-1.4/CHANGES.txt
          • /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Generator.java
          Show
          Hudson added a comment - Integrated in Nutch-branch-1.4 #15 (See https://builds.apache.org/job/Nutch-branch-1.4/15/ ) NUTCH-1074 topN is ignored with maxNumSegments and generate.max.count markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174689 Files : /nutch/branches/branch-1.4/CHANGES.txt /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Generator.java
          Hide
          Markus Jelsma added a comment -

          Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220

          Show
          Markus Jelsma added a comment - Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development