Nutch
  1. Nutch
  2. NUTCH-498

Use Combiner in LinkDb to increase speed of linkdb generation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: linkdb
    • Labels:
      None

      Description

      I tried to add the follwing combiner to LinkDb

      public static enum Counters

      {COMBINED}

      public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job)

      { super.configure(job); _maxInlinks = job.getInt("db.max.inlinks", 10000); }

      public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
      final Inlinks inlinks = (Inlinks) values.next();
      int combined = 0;
      while (values.hasNext()) {
      Inlinks val = (Inlinks) values.next();
      for (Iterator it = val.iterator(); it.hasNext() {
      if (inlinks.size() >= _maxInlinks) {
      if (combined > 0)

      { reporter.incrCounter(Counters.COMBINED, combined); }

      output.collect(key, inlinks);
      return;
      }
      Inlink in = (Inlink) it.next();
      inlinks.add(in);
      }
      combined++;
      }
      if (inlinks.size() == 0)

      { return; }

      if (combined > 0)

      { reporter.incrCounter(Counters.COMBINED, combined); }

      output.collect(key, inlinks);
      }
      }

      This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.

      Map output records 8717810541
      Combined 7632541507
      Resulting output rec 1085269034

      That's a 87% reduction of output records from the map phase

      1. LinkDbCombiner.patch
        0.5 kB
        Espen Amble Kolstad
      2. LinkDbCombiner.patch
        1 kB
        Espen Amble Kolstad

        Activity

        Hide
        Enis Soztutar added a comment -

        I think you may not want

         
        reporter.incrCounter(Counters.COMBINED, combined); 
        

        which increments the counter by the total count so far, but rather you may use

         
        reporter.incrCounter(Counters.COMBINED, 1); 
        

        for each url combined.

        Could you make attach the patch against current trunk, so that we can apply it directly.

        Show
        Enis Soztutar added a comment - I think you may not want reporter.incrCounter(Counters.COMBINED, combined); which increments the counter by the total count so far, but rather you may use reporter.incrCounter(Counters.COMBINED, 1); for each url combined. Could you make attach the patch against current trunk, so that we can apply it directly.
        Hide
        Espen Amble Kolstad added a comment -

        Here's a patch for trunk

        I removed the Counter since it's not really useful information, only to show the reduction of output records.

        Show
        Espen Amble Kolstad added a comment - Here's a patch for trunk I removed the Counter since it's not really useful information, only to show the reduction of output records.
        Hide
        Doğacan Güney added a comment -

        Why can't we just set combiner class as LinkDb? AFAICS, you are not doing anything different than LinkDb.reduce in LinkDbCombiner.reduce. A single-liner

        job.setCombinerClass(LinkDb.class);

        should do the trick, shouldn't it?

        Show
        Doğacan Güney added a comment - Why can't we just set combiner class as LinkDb? AFAICS, you are not doing anything different than LinkDb.reduce in LinkDbCombiner.reduce. A single-liner job.setCombinerClass(LinkDb.class); should do the trick, shouldn't it?
        Hide
        Espen Amble Kolstad added a comment -

        Yes, you're right

        I forgot I added a new class just to get the Counter ...

        Show
        Espen Amble Kolstad added a comment - Yes, you're right I forgot I added a new class just to get the Counter ...
        Hide
        Espen Amble Kolstad added a comment -

        Made a patch for the one-liner mentioned above

        Show
        Espen Amble Kolstad added a comment - Made a patch for the one-liner mentioned above
        Hide
        Doğacan Güney added a comment -

        After examining the code better, I am a bit confused. We have a LinkDb.Merger.reduce and LinkDb.reduce. They both do the same thing (aggregate inlinks until its size is maxInlinks then collect). Why do we have them seperately? Is there a difference between them that I am missing?

        Show
        Doğacan Güney added a comment - After examining the code better, I am a bit confused. We have a LinkDb.Merger.reduce and LinkDb.reduce. They both do the same thing (aggregate inlinks until its size is maxInlinks then collect). Why do we have them seperately? Is there a difference between them that I am missing?
        Hide
        Andrzej Bialecki added a comment -

        Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.

        Show
        Andrzej Bialecki added a comment - Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.
        Hide
        Doğacan Güney added a comment -

        > Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could
        > replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.

        Sounds good. I opened NUTCH-499 for this.

        Show
        Doğacan Güney added a comment - > Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could > replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce. Sounds good. I opened NUTCH-499 for this.
        Hide
        Doğacan Güney added a comment -

        I tested creating a linkdb from ~6M urls:

        Combine input records 42,091,902
        Combine output records 15,684,838

        (Combiner reduces number of records to around 1/3.)

        Job took ~15 min overall with combiner, ~20 minutes without combiner.

        So, +1 from me.

        Show
        Doğacan Güney added a comment - I tested creating a linkdb from ~6M urls: Combine input records 42,091,902 Combine output records 15,684,838 (Combiner reduces number of records to around 1/3.) Job took ~15 min overall with combiner, ~20 minutes without combiner. So, +1 from me.
        Hide
        Andrzej Bialecki added a comment -

        +1.

        Show
        Andrzej Bialecki added a comment - +1.
        Hide
        Sami Siren added a comment -

        +1

        Show
        Sami Siren added a comment - +1
        Hide
        Doğacan Güney added a comment -

        Committed in rev. 551147.

        Show
        Doğacan Güney added a comment - Committed in rev. 551147.
        Hide
        Doğacan Güney added a comment -

        Issue resolved and committed.

        Show
        Doğacan Güney added a comment - Issue resolved and committed.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Nutch-Nightly #131 (See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/ )

          People

          • Assignee:
            Doğacan Güney
            Reporter:
            Espen Amble Kolstad
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development