1. Mahout
  2. MAHOUT-1007

Performance improvement in recommenditembased by splitting long records


    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Not A Problem
    • Affects Version/s: 0.6
    • Fix Version/s: None
    • Labels:


      While running the recommendations with ASFEMail dataset using the example script provided with mahout, we are noticing that one of the map task in unsymmetrify mapper job has a very long execution time than others. While profiling, the problem seems to be with the number of elements in each record. The attached patch address this issue by splitting longer records into smaller once, so the data distributed evenly among the unsymmetrify map tasks.

      There is a new command line option maxSimilarityReducerVectorSize is introduced for RecommanderJob. Tested with maxSimilarityReducerVectorSize=5000 and with same functionality speeds up unsymmetrify mapper job by several X on x86 architectures and increases CPU utilization. By default the records are not split and setting the command line option maxSimilarityReducerVectorSize to a value greater than 0 will increase performance.

      1. Patch_1007.patch
        8 kB
        Bhaskar Devireddy


        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Sean Owen made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 0.7 [ 12319261 ]
        Resolution Not A Problem [ 8 ]
        Bhaskar Devireddy made changes -
        Attachment Patch_1007.patch [ 12525670 ]
        Bhaskar Devireddy made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Bhaskar Devireddy made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Bhaskar Devireddy created issue -


          • Assignee:
            Sean Owen
            Bhaskar Devireddy
          • Votes:
            0 Vote for this issue
            1 Start watching this issue


            • Created: