Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-467

Change Iterable<Cooccurrence> in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 0.4
    • None
    • None

    Description

      In Class AbstractDistributedVectorSimilarity

      protected int countElements(Iterator<?> iterator)
      { int count = 0;
      while (iterator.hasNext())

      { count++; iterator.next(); }

      return count;
      }

      The method countElements is used continually and is called continually ,but it has bad performance.

      If the iterator has million elements ,we have to iterate million times to just get the count of the iterator.

      this methods used in many pacles:
      1) DistributedCooccurrenceVectorSimilarity

      public class DistributedCooccurrenceVectorSimilarity extends AbstractDistributedVectorSimilarity {

      @Override
      protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
      double weightOfVectorB, int numberOfColumns)

      { return countElements(cooccurrences); }

      }

      one items may be liked by many people, we has system ,one items may be liked by hundred thousand persons,
      Here doComputeResult just returned the count of elements in cooccurrences,but It has to iterate for hundred thousand times.

      If we use List or Array type,we can get the result in one call. because it already sets the size of the Array or list when system constructs the List or Array.

      2) DistributedLoglikelihoodVectorSimilarity
      3) DistributedTanimotoCoefficientVectorSimilarity

      I have doing a test using DistributedCooccurrenceVectorSimilarity
      it used 4.5 hours to run RowSimilarityJob-CooccurrencesMapper-SimilarityReducer

      Attachments

        Issue Links

          Activity

            People

              srowen Sean R. Owen
              huiwenhan Han Hui Wen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: