Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-468

Performance of RowSimilarityJob is not good

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Test
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 0.4
    • None
    • None
    • None

    Description

      I have done a test ,

      Preferences records: 680,194
      distinct users: 23,246
      distinct items:437,569
      SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE

      maybePruneItemUserMatrixPath:16.50M
      weights:13.80M
      pairwiseSimilarity:18.81G
      Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
      Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours

      I think the reason may be following:
      1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
      2) We stored redundant info.

      for example :

      the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)

      3) Some frequently used code
      https://issues.apache.org/jira/browse/MAHOUT-467

      4) allocate many local variable in loop (need confirm )

      In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity

      @Override
      public double weight(Vector v) {
      double length = 0.0;

      Iterator<Element> elemIterator = v.iterateNonZero();

      while (elemIterator.hasNext())

      { double value = elemIterator.next().get(); //this one length += value * value; }

      return Math.sqrt(length);
      }

      5) Maybe we need control the size of cooccurrences

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            huiwenhan Han Hui Wen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment