Mahout
  1. Mahout
  2. MAHOUT-905

CachingUserSimilarity and CachingItemSimilarity have wrong (far to small) default maxSizes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Not A Problem
    • Affects Version/s: 0.5
    • Fix Version/s: None
    • Environment:

      Description

      I am currently tuning my recommender discussed here: http://thread.gmane.org/gmane.comp.apache.mahout.user/10433.

      As a first step I wrapped my LogLikelihoodSimilarity with an CachingUserSimilarity. I used Java Visual VM to profile the calls. I recognized that I didn't get any performance benefits. So I had a look into the code.

      Actually line 47 this(similarity, dataModel.getNumItems()); in CachingUserSimilarity.java is wrong. If we want to cache all item similarities we need a cache with (dataModel.getNumItems()*(dataModel.getNumItems()-1))/2 possible entries.

      I am now doing this in the constructor. I attached a patch to adjust this in the trunk build.

        Activity

        Manuel Blechschmidt created issue -
        Hide
        Manuel Blechschmidt added a comment -

        The attached patch fixes this issue.

        Show
        Manuel Blechschmidt added a comment - The attached patch fixes this issue.
        Manuel Blechschmidt made changes -
        Field Original Value New Value
        Attachment CachingSimilariyAdjustedDefaultSize.patch [ 12505671 ]
        Hide
        Manuel Blechschmidt added a comment -

        Attache is a patch solving this issue.

        Show
        Manuel Blechschmidt added a comment - Attache is a patch solving this issue.
        Manuel Blechschmidt made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Sean Owen added a comment -

        (This is hardly a bug!)

        The cache is supposed to be much smaller than the universe of all possible things you might cache, since only a small fraction will represent most of the pairs that are computed. If you cache everything I think you'll find your hit rate drops as lots of the elements are never read a second time. I would rather not create such a massive cache by default, no, though you can of course set it however you like for your use case.

        Show
        Sean Owen added a comment - (This is hardly a bug!) The cache is supposed to be much smaller than the universe of all possible things you might cache, since only a small fraction will represent most of the pairs that are computed. If you cache everything I think you'll find your hit rate drops as lots of the elements are never read a second time. I would rather not create such a massive cache by default, no, though you can of course set it however you like for your use case.
        Sean Owen made changes -
        Issue Type Bug [ 1 ] Improvement [ 4 ]
        Priority Major [ 3 ] Minor [ 4 ]
        Sean Owen made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Not A Problem [ 8 ]
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        1m 35s 1 Manuel Blechschmidt 30/Nov/11 22:13
        Patch Available Patch Available Resolved Resolved
        5m 16s 1 Sean Owen 30/Nov/11 22:19
        Resolved Resolved Closed Closed
        70d 15h 43m 1 Sean Owen 09/Feb/12 14:02

          People

          • Assignee:
            Sean Owen
            Reporter:
            Manuel Blechschmidt
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 0.5h
              0.5h
              Remaining:
              Remaining Estimate - 0.5h
              0.5h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development