Mahout
  1. Mahout
  2. MAHOUT-824

FastByIDRunningAverage: Optimize SlopeOneRecommender by optimizing MemoryDiffStorage

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      The SlopeOneRecommender has by far the best RMS of all of the online recommenders in Mahout (that I've found). Unfortunately the implementation also uses much more memory and is unuseable on my laptop.

      This patch optimizes memory (and speed) by folding FastByIDMap<RunningAverage> into one class: FastByIDRunningAverage. This is what it sounds like: a Long-addressable array of running averages (and optionally standard deviation).

      1. MAHOUT-824.patch
        78 kB
        Lance Norskog
      2. MAHOUT-824.short.patch
        36 kB
        Lance Norskog

        Activity

        Hide
        Ted Dunning added a comment -

        For any method that doesn't have good regularization, trimming helps avoid over-training. Slope-one and all of the correlation methods have zero regularization and are seriously susceptible to coincidence. LLR trimming is kind of the simplest level of regularization. Methods like latent factor log-linear have serious and real regularization and probably don't need trimming.

        Show
        Ted Dunning added a comment - For any method that doesn't have good regularization, trimming helps avoid over-training. Slope-one and all of the correlation methods have zero regularization and are seriously susceptible to coincidence. LLR trimming is kind of the simplest level of regularization. Methods like latent factor log-linear have serious and real regularization and probably don't need trimming.
        Hide
        Sean Owen added a comment -

        I don't have a paper; I am suggesting however that such pruning can even improve the result. A diff based on 1 user is different than one based on 100 users' data. The pruning indeed appears dumb. Ideally it would at least throw out a "poor" diff in favor of a new one that's better. But, under this scheme, it's low-count diffs that are worst. And so a new 1-count diff is by definition among the worst. So it's as optimal a strategy as anything. Put another way, if you haven't seen an A-B diff at all in millions and millions of users, it's unlikely to be of consequence.

        Show
        Sean Owen added a comment - I don't have a paper; I am suggesting however that such pruning can even improve the result. A diff based on 1 user is different than one based on 100 users' data. The pruning indeed appears dumb. Ideally it would at least throw out a "poor" diff in favor of a new one that's better. But, under this scheme, it's low-count diffs that are worst. And so a new 1-count diff is by definition among the worst. So it's as optimal a strategy as anything. Put another way, if you haven't seen an A-B diff at all in millions and millions of users, it's unlikely to be of consequence.
        Hide
        Lance Norskog added a comment - - edited

        Sure, this is fine.

        MemoryDiffStorage exposes the fact that it uses the RunningAverage(AndStdDev) classes. This was a vexing implementation leakage, and is why I had to add new constructors for FullRunningAverage(AndStdDev). If I went after it more, I would correct this.

        Anyway- is there a paper somewhere that defines the error loss for dropping samples in processOneUser? Is there a better subsampler than 'stop after the first N entries'?

        Show
        Lance Norskog added a comment - - edited Sure, this is fine. MemoryDiffStorage exposes the fact that it uses the RunningAverage(AndStdDev) classes. This was a vexing implementation leakage, and is why I had to add new constructors for FullRunningAverage(AndStdDev). If I went after it more, I would correct this. Anyway- is there a paper somewhere that defines the error loss for dropping samples in processOneUser? Is there a better subsampler than 'stop after the first N entries'?
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1077 (See https://builds.apache.org/job/Mahout-Quality/1077/)
        Part of MAHOUT-824 – some additional tests

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1178133
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/common/FullRunningAverage.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/common/FullRunningAverageAndStdDev.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/common/FastByIDMapTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/common/RunningAverageAndStdDevTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/common/RunningAverageTest.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/slopeone/MemoryDiffStorageTest.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1077 (See https://builds.apache.org/job/Mahout-Quality/1077/ ) Part of MAHOUT-824 – some additional tests srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1178133 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/common/FullRunningAverage.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/common/FullRunningAverageAndStdDev.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/common/FastByIDMapTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/common/RunningAverageAndStdDevTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/common/RunningAverageTest.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/slopeone/MemoryDiffStorageTest.java
        Hide
        Sean Owen added a comment -

        Marking "WontFix" but really, part was committed. The rest would have to be fixed up to pass tests first, and we'd probably want more evidence it's justified over the existing pruning methods.

        Show
        Sean Owen added a comment - Marking "WontFix" but really, part was committed. The rest would have to be fixed up to pass tests first, and we'd probably want more evidence it's justified over the existing pruning methods.
        Hide
        Sean Owen added a comment -

        I agree that slope-one is surprisingly effective, though storing all item-item diffs is expensive. The good news is that almost all item-item diffs carry virtually no info; they're noise. So, I think there's a much better way to dodge the memory constraint: prune the diffs. That can get you much farther than even optimizing away object overhead.

        See the constructors that let you cap the number of diffs stored. It can be reasonably smart; it throws out data based on standard deviation of the diffs.

        So, I'd prefer to not yet add such a big amount of copy-and-paste. (And, see below, as you say, I don't think this quite works either, or at least there is some evidence it changes behavior.) I think it's fine work though and worth having in a patch; if it's really popular, well there's nothing terribly wrong with it, just duplicative.

        For the rest of the code –

        I am happy to add more constructors for the running averages in general, sure, plus tests. I think some of the tests need a fix – for example the copy constructor test doesn't test a copy.

        I can add the FastByIDMapTest. The copied keySet/values tests don't test those methods, really but I can remove them.

        I am not sure why the slope-one tests would need to be changed – this suggests there's something about the changes that might have introduced a bug. I might have to peel some of that back if it looks like the case.

        Likewise, I am not sure why the IR stats eval would have changed; this ought not affect anything there.

        I would in general not want changes that make fields public or protected.

        Show
        Sean Owen added a comment - I agree that slope-one is surprisingly effective, though storing all item-item diffs is expensive. The good news is that almost all item-item diffs carry virtually no info; they're noise. So, I think there's a much better way to dodge the memory constraint: prune the diffs. That can get you much farther than even optimizing away object overhead. See the constructors that let you cap the number of diffs stored. It can be reasonably smart; it throws out data based on standard deviation of the diffs. So, I'd prefer to not yet add such a big amount of copy-and-paste. (And, see below, as you say, I don't think this quite works either, or at least there is some evidence it changes behavior.) I think it's fine work though and worth having in a patch; if it's really popular, well there's nothing terribly wrong with it, just duplicative. For the rest of the code – I am happy to add more constructors for the running averages in general, sure, plus tests. I think some of the tests need a fix – for example the copy constructor test doesn't test a copy. I can add the FastByIDMapTest. The copied keySet/values tests don't test those methods, really but I can remove them. I am not sure why the slope-one tests would need to be changed – this suggests there's something about the changes that might have introduced a bug. I might have to peel some of that back if it looks like the case. Likewise, I am not sure why the IR stats eval would have changed; this ought not affect anything there. I would in general not want changes that make fields public or protected.
        Hide
        Lance Norskog added a comment -

        The new MemoryDiffStorage2 is revamped to use FastByIDRunningAverage. The problem is that SlopeOneRecommender gives different RMS evaluation numbers on real data. The unit tests match- I've added a bunch more unit tests trying to track down the problem. Only with real data is there a problem.

        I'm giving up for the nonce. SlopeOne is so much better that I still think it is worth optimizing, but I've run out of steam.

        FastByIDRunningAverage is tight and could be used in other places, so I've packaged it as a separate smaller patch.

        Show
        Lance Norskog added a comment - The new MemoryDiffStorage2 is revamped to use FastByIDRunningAverage. The problem is that SlopeOneRecommender gives different RMS evaluation numbers on real data. The unit tests match- I've added a bunch more unit tests trying to track down the problem. Only with real data is there a problem. I'm giving up for the nonce. SlopeOne is so much better that I still think it is worth optimizing, but I've run out of steam. FastByIDRunningAverage is tight and could be used in other places, so I've packaged it as a separate smaller patch.
        Hide
        Lance Norskog added a comment -

        MAHOUT-824.patch includes the whole project.
        MAHOUT-824.short.patch includes only additions to the utilities. FastByIRunningAverage and some fixes to FullRunningAverage&StdDev, and unit test enhancements.

        Show
        Lance Norskog added a comment - MAHOUT-824 .patch includes the whole project. MAHOUT-824 .short.patch includes only additions to the utilities. FastByIRunningAverage and some fixes to FullRunningAverage&StdDev, and unit test enhancements.

          People

          • Assignee:
            Sean Owen
            Reporter:
            Lance Norskog
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development