Nutch
  1. Nutch
  2. NUTCH-498

Use Combiner in LinkDb to increase speed of linkdb generation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: linkdb
    • Labels:
      None

      Description

      I tried to add the follwing combiner to LinkDb

      public static enum Counters

      {COMBINED}

      public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job)

      { super.configure(job); _maxInlinks = job.getInt("db.max.inlinks", 10000); }

      public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
      final Inlinks inlinks = (Inlinks) values.next();
      int combined = 0;
      while (values.hasNext()) {
      Inlinks val = (Inlinks) values.next();
      for (Iterator it = val.iterator(); it.hasNext() {
      if (inlinks.size() >= _maxInlinks) {
      if (combined > 0)

      { reporter.incrCounter(Counters.COMBINED, combined); }

      output.collect(key, inlinks);
      return;
      }
      Inlink in = (Inlink) it.next();
      inlinks.add(in);
      }
      combined++;
      }
      if (inlinks.size() == 0)

      { return; }

      if (combined > 0)

      { reporter.incrCounter(Counters.COMBINED, combined); }

      output.collect(key, inlinks);
      }
      }

      This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.

      Map output records 8717810541
      Combined 7632541507
      Resulting output rec 1085269034

      That's a 87% reduction of output records from the map phase

      1. LinkDbCombiner.patch
        1 kB
        Espen Amble Kolstad
      2. LinkDbCombiner.patch
        0.5 kB
        Espen Amble Kolstad

        Activity

        Espen Amble Kolstad created issue -
        Espen Amble Kolstad made changes -
        Field Original Value New Value
        Description I tried to add the follwing combiner to LinkDb

        {code}
           public static class LinkDbCombiner extends MapReduceBase implements Reducer {
              private int _maxInlinks;

              @Override
              public void configure(JobConf job) {
                 super.configure(job);
                 _maxInlinks = job.getInt("db.max.inlinks", 10000);
              }

              public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
                    final Inlinks inlinks = (Inlinks) values.next();
                    int combined = 0;
                    while (values.hasNext()) {
                       Inlinks val = (Inlinks) values.next();
                       for (Iterator it = val.iterator(); it.hasNext();) {
                          if (inlinks.size() >= _maxInlinks) {
                             output.collect(key, inlinks);
                             return;
                          }
                          Inlink in = (Inlink) it.next();
                          inlinks.add(in);
                       }
                       combined++;
                    }
                    if (inlinks.size() == 0) {
                       return;
                    }
                    if (combined > 0) {
                       reporter.incrCounter(Counters.COMBINED, combined);
                    }
                    output.collect(key, inlinks);
              }
           }
        {code}

        This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.


        |Map output records|8717810541|
        |Combined|7632541507|
        |Resulting output rec11085269034|

        That's a 87% reduction of output records from the map phase
        I tried to add the follwing combiner to LinkDb

           public static enum Counters {COMBINED}

           public static class LinkDbCombiner extends MapReduceBase implements Reducer {
              private int _maxInlinks;

              @Override
              public void configure(JobConf job) {
                 super.configure(job);
                 _maxInlinks = job.getInt("db.max.inlinks", 10000);
              }

              public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
                    final Inlinks inlinks = (Inlinks) values.next();
                    int combined = 0;
                    while (values.hasNext()) {
                       Inlinks val = (Inlinks) values.next();
                       for (Iterator it = val.iterator(); it.hasNext();) {
                          if (inlinks.size() >= _maxInlinks) {
                             if (combined > 0) {
                                reporter.incrCounter(Counters.COMBINED, combined);
                             }
                             output.collect(key, inlinks);
                             return;
                          }
                          Inlink in = (Inlink) it.next();
                          inlinks.add(in);
                       }
                       combined++;
                    }
                    if (inlinks.size() == 0) {
                       return;
                    }
                    if (combined > 0) {
                       reporter.incrCounter(Counters.COMBINED, combined);
                    }
                    output.collect(key, inlinks);
              }
           }

        This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.


        Map output records 8717810541
        Combined 7632541507
        Resulting output rec 1085269034

        That's a 87% reduction of output records from the map phase
        Hide
        Enis Soztutar added a comment -

        I think you may not want

         
        reporter.incrCounter(Counters.COMBINED, combined); 
        

        which increments the counter by the total count so far, but rather you may use

         
        reporter.incrCounter(Counters.COMBINED, 1); 
        

        for each url combined.

        Could you make attach the patch against current trunk, so that we can apply it directly.

        Show
        Enis Soztutar added a comment - I think you may not want reporter.incrCounter(Counters.COMBINED, combined); which increments the counter by the total count so far, but rather you may use reporter.incrCounter(Counters.COMBINED, 1); for each url combined. Could you make attach the patch against current trunk, so that we can apply it directly.
        Hide
        Espen Amble Kolstad added a comment -

        Here's a patch for trunk

        I removed the Counter since it's not really useful information, only to show the reduction of output records.

        Show
        Espen Amble Kolstad added a comment - Here's a patch for trunk I removed the Counter since it's not really useful information, only to show the reduction of output records.
        Espen Amble Kolstad made changes -
        Attachment LinkDbCombiner.patch [ 12359824 ]
        Hide
        Doğacan Güney added a comment -

        Why can't we just set combiner class as LinkDb? AFAICS, you are not doing anything different than LinkDb.reduce in LinkDbCombiner.reduce. A single-liner

        job.setCombinerClass(LinkDb.class);

        should do the trick, shouldn't it?

        Show
        Doğacan Güney added a comment - Why can't we just set combiner class as LinkDb? AFAICS, you are not doing anything different than LinkDb.reduce in LinkDbCombiner.reduce. A single-liner job.setCombinerClass(LinkDb.class); should do the trick, shouldn't it?
        Hide
        Espen Amble Kolstad added a comment -

        Yes, you're right

        I forgot I added a new class just to get the Counter ...

        Show
        Espen Amble Kolstad added a comment - Yes, you're right I forgot I added a new class just to get the Counter ...
        Hide
        Espen Amble Kolstad added a comment -

        Made a patch for the one-liner mentioned above

        Show
        Espen Amble Kolstad added a comment - Made a patch for the one-liner mentioned above
        Espen Amble Kolstad made changes -
        Attachment LinkDbCombiner.patch [ 12359872 ]
        Hide
        Doğacan Güney added a comment -

        After examining the code better, I am a bit confused. We have a LinkDb.Merger.reduce and LinkDb.reduce. They both do the same thing (aggregate inlinks until its size is maxInlinks then collect). Why do we have them seperately? Is there a difference between them that I am missing?

        Show
        Doğacan Güney added a comment - After examining the code better, I am a bit confused. We have a LinkDb.Merger.reduce and LinkDb.reduce. They both do the same thing (aggregate inlinks until its size is maxInlinks then collect). Why do we have them seperately? Is there a difference between them that I am missing?
        Hide
        Andrzej Bialecki added a comment -

        Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.

        Show
        Andrzej Bialecki added a comment - Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.
        Hide
        Doğacan Güney added a comment -

        > Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could
        > replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce.

        Sounds good. I opened NUTCH-499 for this.

        Show
        Doğacan Güney added a comment - > Currently there is no difference, indeed. The version in LinkDb.reduce is safer, because it uses a separate instance of Inlinks. Perhaps we could > replace LinkDb.Merger.reduce with the body of LinkDb.reduce, and completely remove LinkDb.reduce. Sounds good. I opened NUTCH-499 for this.
        Hide
        Doğacan Güney added a comment -

        I tested creating a linkdb from ~6M urls:

        Combine input records 42,091,902
        Combine output records 15,684,838

        (Combiner reduces number of records to around 1/3.)

        Job took ~15 min overall with combiner, ~20 minutes without combiner.

        So, +1 from me.

        Show
        Doğacan Güney added a comment - I tested creating a linkdb from ~6M urls: Combine input records 42,091,902 Combine output records 15,684,838 (Combiner reduces number of records to around 1/3.) Job took ~15 min overall with combiner, ~20 minutes without combiner. So, +1 from me.
        Hide
        Andrzej Bialecki added a comment -

        +1.

        Show
        Andrzej Bialecki added a comment - +1.
        Hide
        Sami Siren added a comment -

        +1

        Show
        Sami Siren added a comment - +1
        Hide
        Doğacan Güney added a comment -

        Committed in rev. 551147.

        Show
        Doğacan Güney added a comment - Committed in rev. 551147.
        Doğacan Güney made changes -
        Assignee Doğacan Güney [ dogacan ]
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Fix Version/s 1.0.0 [ 12312443 ]
        Hide
        Doğacan Güney added a comment -

        Issue resolved and committed.

        Show
        Doğacan Güney added a comment - Issue resolved and committed.
        Doğacan Güney made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Nutch-Nightly #131 (See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/ )

          People

          • Assignee:
            Doğacan Güney
            Reporter:
            Espen Amble Kolstad
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development