Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-767

Improve RowSimilarityJob performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.5
    • 0.6
    • None
    • None

    Description

      (See http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7 for background)

      Currently, the RowSimilarityJob defers the calculation of the similarity metric until the reduce phase, while emitting many Cooccurrence objects. For similarity metrics that are algebraic (http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should be able to do much of the computation during the Mapper part of this phase and also take advantage of a Combiner.

      We should use a marker interface to know whether a similarity metric is algebraic and then make use of an appropriate Mapper implementation, otherwise we can fall back on our existing implementation.

      Attachments

        1. MAHOUT-767-2.patch
          232 kB
          Sebastian Schelter
        2. MAHOUT-767.patch
          60 kB
          Sebastian Schelter

        Activity

          People

            ssc Sebastian Schelter
            gsingers Grant Ingersoll
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: