Mahout
  1. Mahout
  2. MAHOUT-467

Change Iterable<Cooccurrence> in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Not a Problem
    • Affects Version/s: 0.4
    • Fix Version/s: None
    • Labels:
      None

      Description

      In Class AbstractDistributedVectorSimilarity

      protected int countElements(Iterator<?> iterator)
      { int count = 0;
      while (iterator.hasNext())

      { count++; iterator.next(); }

      return count;
      }

      The method countElements is used continually and is called continually ,but it has bad performance.

      If the iterator has million elements ,we have to iterate million times to just get the count of the iterator.

      this methods used in many pacles:
      1) DistributedCooccurrenceVectorSimilarity

      public class DistributedCooccurrenceVectorSimilarity extends AbstractDistributedVectorSimilarity {

      @Override
      protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
      double weightOfVectorB, int numberOfColumns)

      { return countElements(cooccurrences); }

      }

      one items may be liked by many people, we has system ,one items may be liked by hundred thousand persons,
      Here doComputeResult just returned the count of elements in cooccurrences,but It has to iterate for hundred thousand times.

      If we use List or Array type,we can get the result in one call. because it already sets the size of the Array or list when system constructs the List or Array.

      2) DistributedLoglikelihoodVectorSimilarity
      3) DistributedTanimotoCoefficientVectorSimilarity

      I have doing a test using DistributedCooccurrenceVectorSimilarity
      it used 4.5 hours to run RowSimilarityJob-CooccurrencesMapper-SimilarityReducer

        Issue Links

          Activity

          Hide
          Sebastian Schelter added a comment - - edited

          For the millions of cooccurrences to be modeled as a list or an array, we would have to simultaneously load them all into memory.
          We can't do this because then the scalability of the whole job would be limited by the amount of RAM available on the worker machines.

          IIRC Mahout's goal is that its distributed jobs run in linear time compared to the amount of input data and need a constant amount of memory.

          Show
          Sebastian Schelter added a comment - - edited For the millions of cooccurrences to be modeled as a list or an array, we would have to simultaneously load them all into memory. We can't do this because then the scalability of the whole job would be limited by the amount of RAM available on the worker machines. IIRC Mahout's goal is that its distributed jobs run in linear time compared to the amount of input data and need a constant amount of memory.
          Hide
          Han Hui Wen added a comment -

          millions of cooccurrences is just a assumption.

          even thousand of cooccurrences ,iterating just for count is low performance.

          Show
          Han Hui Wen added a comment - millions of cooccurrences is just a assumption. even thousand of cooccurrences ,iterating just for count is low performance.
          Hide
          Sebastian Schelter added a comment -

          There's no point of time in the algorithm before the invocation of the SimilarityReducer where all cooccurrences for a pair of rows are seen together, so to load them into memory there it would also be necessary to iterate over them.

          Show
          Sebastian Schelter added a comment - There's no point of time in the algorithm before the invocation of the SimilarityReducer where all cooccurrences for a pair of rows are seen together, so to load them into memory there it would also be necessary to iterate over them.
          Hide
          Ted Dunning added a comment -

          This does expose another problem, though, in that these counts should be combined in addition to being reduced. That means that the count++ should be something like count += value and there should be an upstream combiner somewhere.

          That will do much more good than using lists.

          Show
          Ted Dunning added a comment - This does expose another problem, though, in that these counts should be combined in addition to being reduced. That means that the count++ should be something like count += value and there should be an upstream combiner somewhere. That will do much more good than using lists.
          Hide
          Han Hui Wen added a comment -

          I have opened one issue for MAPREDUCE,here is their comment :

          https://issues.apache.org/jira/browse/MAPREDUCE-2007

          Show
          Han Hui Wen added a comment - I have opened one issue for MAPREDUCE,here is their comment : https://issues.apache.org/jira/browse/MAPREDUCE-2007
          Hide
          Ted Dunning added a comment -

          That comment from Owen is essentially the same as what I and others have said. If you need to count, use integers and combiners. Don't wait until the reducer.

          In any case, it isn't a MAHOUT bug that map-reduce inherently doesn't allow you to see all the data being reduced in one place.

          Anybody mind if I close this as NOT-A-BUG?

          Show
          Ted Dunning added a comment - That comment from Owen is essentially the same as what I and others have said. If you need to count, use integers and combiners. Don't wait until the reducer. In any case, it isn't a MAHOUT bug that map-reduce inherently doesn't allow you to see all the data being reduced in one place. Anybody mind if I close this as NOT-A-BUG?
          Hide
          Hudson added a comment -

          Integrated in Mahout-Quality #200 (See https://hudson.apache.org/hudson/job/Mahout-Quality/200/)
          MAHOUT-467: removed static modifiers on driver private methods that were not invoked statically anyway

          Show
          Hudson added a comment - Integrated in Mahout-Quality #200 (See https://hudson.apache.org/hudson/job/Mahout-Quality/200/ ) MAHOUT-467 : removed static modifiers on driver private methods that were not invoked statically anyway

            People

            • Assignee:
              Sean Owen
              Reporter:
              Han Hui Wen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development