Mahout
  1. Mahout
  2. MAHOUT-910

Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Labels:
      None

      Description

      Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

      1. SamplingCandidateItemsStrategy.java
        5 kB
        Sean Owen
      2. MAHOUT-910.patch
        8 kB
        Sean Owen
      3. MAHOUT-910.patch
        7 kB
        Sean Owen
      4. MAHOUT-910.patch
        7 kB
        Sean Owen

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1217 (See https://builds.apache.org/job/Mahout-Quality/1217/)
        MAHOUT-910 prelude: commit some clear wins in optimizing calls to intersectionSize()

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209577
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/model/GenericBooleanPrefDataModel.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.java
        • /mahout/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1217 (See https://builds.apache.org/job/Mahout-Quality/1217/ ) MAHOUT-910 prelude: commit some clear wins in optimizing calls to intersectionSize() srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209577 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/model/GenericBooleanPrefDataModel.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.java /mahout/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
        Hide
        Lance Norskog added a comment -
        return userIDs1.size() < userIDs2.size() ?
           userIDs2.intersectionSize(userIDs1) :
           userIDs1.intersectionSize(userIDs2);
        

        Could this optimization be pushed into FastIDSet?

        Show
        Lance Norskog added a comment - return userIDs1.size() < userIDs2.size() ? userIDs2.intersectionSize(userIDs1) : userIDs1.intersectionSize(userIDs2); Could this optimization be pushed into FastIDSet?
        Hide
        Sean Owen added a comment -

        Yeah you could; in a few cases the caller already knows the size or it can be included in another if statement, so I left it out.

        Show
        Sean Owen added a comment - Yeah you could; in a few cases the caller already knows the size or it can be included in another if statement, so I left it out.
        Hide
        Sean Owen added a comment -

        This is what I'm proposing to increase sample-ability. Now sampling applies also to items per user, not just users.

        Show
        Sean Owen added a comment - This is what I'm proposing to increase sample-ability. Now sampling applies also to items per user, not just users.
        Hide
        Sean Owen added a comment -

        Now, samples all three things: user's item, those items' users, and those users' items.

        Show
        Sean Owen added a comment - Now, samples all three things: user's item, those items' users, and those users' items.
        Hide
        Sean Owen added a comment -

        Daniel says:

        Hi Sean,
        I have been playing around with your patch. It looks good.
        From the little testing I did, I can also say that the recommendations seem
        to be more accurate than in my initial proposal (#4).

        I just have one suggestion though. I think the current parameters (int
        defaultMaxPrefsPerItemConsidered, int userItemCountMultiplier) are not so
        clear and don't give enough control over the sampling.
        I would introduce two other parameters (it won't be backwards-compatible
        though) -

        • maxSourcePrefsConsidered: which will be used
          in conjunction with SamplingLongPrimitiveIterator to do #1.
        • maxFinalPrefs : which will set the value for 'int max' in your patch
          (i.e. get rid of max = (int) Math.max(defaultMaxPrefsPerItemConsidered,
          userItemCountMultiplier * Math.log(Math.max(dataModel.getNumUsers(),
          dataModel.getNumItems()))); )

        In the future it would be possible to add a strategy that will affect the
        way maxSourcePrefsConsidered is sampled. For example, most recent items or
        least recent items or random sampling (like we have now). Even though that
        might not be the place to do so.. (since it's not in the context of the
        user)

        What do you think?

        Show
        Sean Owen added a comment - Daniel says: Hi Sean, I have been playing around with your patch. It looks good. From the little testing I did, I can also say that the recommendations seem to be more accurate than in my initial proposal (#4). I just have one suggestion though. I think the current parameters (int defaultMaxPrefsPerItemConsidered, int userItemCountMultiplier) are not so clear and don't give enough control over the sampling. I would introduce two other parameters (it won't be backwards-compatible though) - maxSourcePrefsConsidered: which will be used in conjunction with SamplingLongPrimitiveIterator to do #1. maxFinalPrefs : which will set the value for 'int max' in your patch (i.e. get rid of max = (int) Math.max(defaultMaxPrefsPerItemConsidered, userItemCountMultiplier * Math.log(Math.max(dataModel.getNumUsers(), dataModel.getNumItems()))); ) In the future it would be possible to add a strategy that will affect the way maxSourcePrefsConsidered is sampled. For example, most recent items or least recent items or random sampling (like we have now). Even though that might not be the place to do so.. (since it's not in the context of the user) What do you think?
        Hide
        Sean Owen added a comment -

        I agree. Since we have three samplings here, the simplest thing is to expose three settings to control each of them individually.
        Right now one setting is exposed, as two numbers. I could just clone (and rename) that pair of parameters so that they can be tuned individually. I could keep the 2-arg constructor and have it set all 3 pairs of parameters to the single pair given, to preserve a bit of backward-compatibility. (Behavior is still quite different though.)

        I think we can endlessly add hook after hook and strategy inside strategies; at some point I'd probably say it's easier for you to maintain your own copy with particular behavior changes if you really need to. At least it'd be better to see it's opening up support for a broad class of use cases.

        Show
        Sean Owen added a comment - I agree. Since we have three samplings here, the simplest thing is to expose three settings to control each of them individually. Right now one setting is exposed, as two numbers. I could just clone (and rename) that pair of parameters so that they can be tuned individually. I could keep the 2-arg constructor and have it set all 3 pairs of parameters to the single pair given, to preserve a bit of backward-compatibility. (Behavior is still quite different though.) I think we can endlessly add hook after hook and strategy inside strategies; at some point I'd probably say it's easier for you to maintain your own copy with particular behavior changes if you really need to. At least it'd be better to see it's opening up support for a broad class of use cases.
        Hide
        Sean Owen added a comment -

        See comments on next file.

        Show
        Sean Owen added a comment - See comments on next file.
        Hide
        Sean Owen added a comment -

        It changed so much it might be easier to read the new source file. Here I've gone another way: three parameters to control, but they're all just the "factor" from Sebstian's previous code. It lets you specify limits of the form f*log, so limits are logarithmic in the number of items/users. How's this?

        There's no overall cap since it is determined by these three values.

        Show
        Sean Owen added a comment - It changed so much it might be easier to read the new source file. Here I've gone another way: three parameters to control, but they're all just the "factor" from Sebstian's previous code. It lets you specify limits of the form f*log , so limits are logarithmic in the number of items/users. How's this? There's no overall cap since it is determined by these three values.
        Hide
        Ted Dunning added a comment -

        Sounds like this will down-sample at least some items from users with few items.

        I would recommend not eliminating any items from users with a small number of items and sampling to a hard limit above that.

        Show
        Ted Dunning added a comment - Sounds like this will down-sample at least some items from users with few items. I would recommend not eliminating any items from users with a small number of items and sampling to a hard limit above that.
        Hide
        Sean Owen added a comment -

        It's still computing some maximum (for each of three different things) and allowing everything if the number of things is less than the max, and only sampling if it exceeds the max. I think it's the same idea as before in this regard. Or are you questioning the default 'factor'? I picked 5. Right now if you have about 10,000 items, it will sample when a user exceeds 5*ln(10000) ~= 46 items.

        Show
        Sean Owen added a comment - It's still computing some maximum (for each of three different things) and allowing everything if the number of things is less than the max, and only sampling if it exceeds the max. I think it's the same idea as before in this regard. Or are you questioning the default 'factor'? I picked 5. Right now if you have about 10,000 items, it will sample when a user exceeds 5*ln(10000) ~= 46 items.
        Hide
        Sean Owen added a comment -

        If I've understood Ted right then I'm not hearing objection to taking this approach. We can modify it more later. Anyone mind if I get this in?

        Show
        Sean Owen added a comment - If I've understood Ted right then I'm not hearing objection to taking this approach. We can modify it more later. Anyone mind if I get this in?
        Hide
        Daniel Zohar added a comment - - edited

        I think the latest implementation looks great. It gives very fine control over the sampling.
        However I do have to agree with Ted. In my code, I had to create a hack for users with few items. Otherwise they would get very little to no recommendations.

        Show
        Daniel Zohar added a comment - - edited I think the latest implementation looks great. It gives very fine control over the sampling. However I do have to agree with Ted. In my code, I had to create a hack for users with few items. Otherwise they would get very little to no recommendations.
        Hide
        Sean Owen added a comment -

        Isn't this just a matter of setting the limits as you like? To be clear, any set smaller than a given limit is not sampled.

        Show
        Sean Owen added a comment - Isn't this just a matter of setting the limits as you like? To be clear, any set smaller than a given limit is not sampled.
        Hide
        Daniel Zohar added a comment -

        Ok, I see what you mean. Then I find the solution very suitable (at least for the problem I had..
        Do you plan to commit it anytime soon?

        Show
        Daniel Zohar added a comment - Ok, I see what you mean. Then I find the solution very suitable (at least for the problem I had.. Do you plan to commit it anytime soon?
        Hide
        Sebastian Schelter added a comment -

        Is it possible to get the same behavior as in MAHOUT-914 with the latest version?

        Show
        Sebastian Schelter added a comment - Is it possible to get the same behavior as in MAHOUT-914 with the latest version?
        Hide
        Sean Owen added a comment -

        Yes, you would just set a very high value for the first and third limits, and set your desired limit for the second one. I think it's a superset of the original implementation and yours.

        It occurs to me we need a reliable way to specify "no limit". And you're using log base 2 instead of e, which maybe makes more sense. Why don't I bake those two ideas in – then I think it's the same thing?

        Show
        Sean Owen added a comment - Yes, you would just set a very high value for the first and third limits, and set your desired limit for the second one. I think it's a superset of the original implementation and yours. It occurs to me we need a reliable way to specify "no limit". And you're using log base 2 instead of e, which maybe makes more sense. Why don't I bake those two ideas in – then I think it's the same thing?
        Hide
        Sebastian Schelter added a comment -

        That would be great. I'd also suggest that only sampling the items per user should be the default behaviour.

        Show
        Sebastian Schelter added a comment - That would be great. I'd also suggest that only sampling the items per user should be the default behaviour.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1233 (See https://builds.apache.org/job/Mahout-Quality/1233/)
        MAHOUT-910 revamp sampling candidate strategy to expose sampling of items, items' users, and those users' items

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211377
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategyTest.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1233 (See https://builds.apache.org/job/Mahout-Quality/1233/ ) MAHOUT-910 revamp sampling candidate strategy to expose sampling of items, items' users, and those users' items srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211377 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategyTest.java
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1234 (See https://builds.apache.org/job/Mahout-Quality/1234/)
        MAHOUT-910 merge ideas from MAHOUT-914, better docs, new no-limit arg, different defaults from Sebastian

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211439
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1234 (See https://builds.apache.org/job/Mahout-Quality/1234/ ) MAHOUT-910 merge ideas from MAHOUT-914 , better docs, new no-limit arg, different defaults from Sebastian srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211439 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sean Owen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development