Nutch
  1. Nutch
  2. NUTCH-498

Use Combiner in LinkDb to increase speed of linkdb generation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 1.0.0
    • Component/s: linkdb
    • Labels:
      None

      Description

      I tried to add the follwing combiner to LinkDb

      public static enum Counters

      {COMBINED}

      public static class LinkDbCombiner extends MapReduceBase implements Reducer {
      private int _maxInlinks;

      @Override
      public void configure(JobConf job)

      { super.configure(job); _maxInlinks = job.getInt("db.max.inlinks", 10000); }

      public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
      final Inlinks inlinks = (Inlinks) values.next();
      int combined = 0;
      while (values.hasNext()) {
      Inlinks val = (Inlinks) values.next();
      for (Iterator it = val.iterator(); it.hasNext() {
      if (inlinks.size() >= _maxInlinks) {
      if (combined > 0)

      { reporter.incrCounter(Counters.COMBINED, combined); }

      output.collect(key, inlinks);
      return;
      }
      Inlink in = (Inlink) it.next();
      inlinks.add(in);
      }
      combined++;
      }
      if (inlinks.size() == 0)

      { return; }

      if (combined > 0)

      { reporter.incrCounter(Counters.COMBINED, combined); }

      output.collect(key, inlinks);
      }
      }

      This greatly reduced the time it took to generate a new linkdb. In my case it reduced the time by half.

      Map output records 8717810541
      Combined 7632541507
      Resulting output rec 1085269034

      That's a 87% reduction of output records from the map phase

      1. LinkDbCombiner.patch
        0.5 kB
        Espen Amble Kolstad
      2. LinkDbCombiner.patch
        1 kB
        Espen Amble Kolstad

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Doğacan Güney
            Reporter:
            Espen Amble Kolstad
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development