Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-468

Performance of RowSimilarityJob is not good

    XMLWordPrintableJSON

Details

    • Test
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 0.4
    • None
    • None

    Description

      I have done a test ,

      Preferences records: 680,194
      distinct users: 23,246
      distinct items:437,569
      SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE

      maybePruneItemUserMatrixPath:16.50M
      weights:13.80M
      pairwiseSimilarity:18.81G
      Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
      Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours

      I think the reason may be following:
      1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
      2) We stored redundant info.

      for example :

      the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)

      3) Some frequently used code
      https://issues.apache.org/jira/browse/MAHOUT-467

      4) allocate many local variable in loop (need confirm )

      In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity

      @Override
      public double weight(Vector v) {
      double length = 0.0;

      Iterator<Element> elemIterator = v.iterateNonZero();

      while (elemIterator.hasNext())

      { double value = elemIterator.next().get(); //this one length += value * value; }

      return Math.sqrt(length);
      }

      5) Maybe we need control the size of cooccurrences

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              huiwenhan Han Hui Wen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: