Mahout
  1. Mahout
  2. MAHOUT-925

Evaluate the reach of recommender algorithms

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Labels:
      None

      Description

      The evaluation of a CF algorithm should include reach, the proportion of users for whom a recommendation could be made. An algorithm usually has a cutoff value on the confidence of the recommender, and if it is not high enough, no recommendation is made. The number of requested recommendations, or this parameter could be varied as part of the evaluation. The proposed patch adds this.

      My build with this patch breaks testMapper(org.apache.mahout.classifier.df.mapreduce.partial.Step1MapperTest): org.apache.mahout.classifier.df.node.Leaf.<init>(I)V . The test seems unrelated to the patch, so I am assuming this is broken in the trunk head as well. Unfortunately I am under a deadline, and I do not have time to write tests for the patch.

      1. MAHOUT-925.patch
        3 kB
        Anatoliy Kats
      2. MAHOUT-925.patch
        4 kB
        Anatoliy Kats

        Activity

        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1252 (See https://builds.apache.org/job/Mahout-Quality/1252/)
        MAHOUT-925 Add basic idea of 'reach'

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1213930
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/eval/IRStatistics.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/eval/GenericRecommenderIRStatsEvaluator.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/eval/IRStatisticsImpl.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1252 (See https://builds.apache.org/job/Mahout-Quality/1252/ ) MAHOUT-925 Add basic idea of 'reach' srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1213930 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/eval/IRStatistics.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/eval/GenericRecommenderIRStatsEvaluator.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/eval/IRStatisticsImpl.java
        Sean Owen made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.6 [ 12316364 ]
        Resolution Fixed [ 1 ]
        Hide
        Sean Owen added a comment -

        Committed a variation on the patch which closes the immediate issue; we could talk about it further though.

        Show
        Sean Owen added a comment - Committed a variation on the patch which closes the immediate issue; we could talk about it further though.
        Hide
        Sean Owen added a comment -

        @Anatoliy how would the recommender decide a relevance threshold? That is a priori knowledge not something the recommender can know. This seems orthogonal to the other issues here. If you're OK with the computation as described I'll commit and of course can iterate on it later.

        @Ted that's probably true, though all the same I'm happy to add 3 lines of code and 1 more field to the result to compute 'reach' as Anatoliy describes here. It's very little additional code or complexity.

        Show
        Sean Owen added a comment - @Anatoliy how would the recommender decide a relevance threshold? That is a priori knowledge not something the recommender can know. This seems orthogonal to the other issues here. If you're OK with the computation as described I'll commit and of course can iterate on it later. @Ted that's probably true, though all the same I'm happy to add 3 lines of code and 1 more field to the result to compute 'reach' as Anatoliy describes here. It's very little additional code or complexity.
        Hide
        Ted Dunning added a comment -

        Reach is a nice statistic to have, but I think it can be had more simply than this.

        In my experience, quality of recommendations depends very strongly on the number of items in the history. Where the history is too small, recommendations will typically be pretty poor and above a threshold, they will be as good as they are going to be. For music, that threshold was 5-10 items, for video it was comparable.

        IF this is true, then the reach computation can be broken into two parts:

        a) what is the threshold?

        b) how many people reach the threshold?

        The first question is answerable by the standard precision recall measurement methods except that the resulting data need to be averaged with an awareness of the history size so that the threshold can be detected.

        The second question is simple arithmetic and doesn't need a framework.

        Show
        Ted Dunning added a comment - Reach is a nice statistic to have, but I think it can be had more simply than this. In my experience, quality of recommendations depends very strongly on the number of items in the history. Where the history is too small, recommendations will typically be pretty poor and above a threshold, they will be as good as they are going to be. For music, that threshold was 5-10 items, for video it was comparable. IF this is true, then the reach computation can be broken into two parts: a) what is the threshold? b) how many people reach the threshold? The first question is answerable by the standard precision recall measurement methods except that the resulting data need to be averaged with an awareness of the history size so that the threshold can be detected. The second question is simple arithmetic and doesn't need a framework.
        Anatoliy Kats made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Anatoliy Kats made changes -
        Attachment MAHOUT-925.patch [ 12507177 ]
        Hide
        Anatoliy Kats added a comment -

        We agree that there needs to be enough training data for a recommender to output something, but you believe the cutoff should happen in the evaluator, whereas I think the recommender should figure this out by itself, via some sort of a threshold on the expected rating. For now, the distinction is mostly theoretical to, so let's use what we already have. I'll change the reach calculation as you suggest.

        Show
        Anatoliy Kats added a comment - We agree that there needs to be enough training data for a recommender to output something, but you believe the cutoff should happen in the evaluator, whereas I think the recommender should figure this out by itself, via some sort of a threshold on the expected rating. For now, the distinction is mostly theoretical to, so let's use what we already have. I'll change the reach calculation as you suggest.
        Hide
        Sean Owen added a comment -

        Yes you could create a different kind of test that doesn't hold out any data to find this reach figure. I don't think it's worth a whole different test class just for this. The entire test framework is only valid insofar as you run it on enough data, with enough to train, that the result reflects how the full system works. So I think it's as valid as anything else to run on the training data only.

        Regarding the "2@" prefs heuristic: it's not really a question of the recommender deciding not to recommend. It's that it will always recommend as much as possible, up to what you ask for. But if the test is based on so little data to begin with, the result is not very meaningful. If I am figuring precision@5 and the user has only 4 prefs, what can I do? I can't even call all 4 "relevant" items since it would leave no training data. Even if I did, there would be no way to achieve 100% precision as there are only 4 relevant items. I (arbitrarily) picked 2@ as the minimum – 10 here if @=5 – since you can select 5 of the 10 in this case as relevant, and have as many available for training.

        You would not want to drop a user's result just because it recommended 3 items in a test @5. That's a perfectly valid result (given the condition in the preceding paragraph) to include. You can still decide how many of those 3 are relevant, and how many of the relevant items are in those 3.

        Precision and recall are not the same in general. If the number of items deemed relevant is equal to "@", then precision will equal recall, yes. And that is usually true for data with ratings, the way this class works. It will just choose some "@" of the items, as there is no basis to call one more relevant than the other. Choosing that many is also somewhat arbitrary; it can't be 0, and can't be all items (or there would no training data from the user under test), so that looked like a nice round number.

        Show
        Sean Owen added a comment - Yes you could create a different kind of test that doesn't hold out any data to find this reach figure. I don't think it's worth a whole different test class just for this. The entire test framework is only valid insofar as you run it on enough data, with enough to train, that the result reflects how the full system works. So I think it's as valid as anything else to run on the training data only. Regarding the "2@" prefs heuristic: it's not really a question of the recommender deciding not to recommend. It's that it will always recommend as much as possible, up to what you ask for. But if the test is based on so little data to begin with, the result is not very meaningful. If I am figuring precision@5 and the user has only 4 prefs, what can I do? I can't even call all 4 "relevant" items since it would leave no training data. Even if I did, there would be no way to achieve 100% precision as there are only 4 relevant items. I (arbitrarily) picked 2@ as the minimum – 10 here if @=5 – since you can select 5 of the 10 in this case as relevant, and have as many available for training. You would not want to drop a user's result just because it recommended 3 items in a test @5. That's a perfectly valid result (given the condition in the preceding paragraph) to include. You can still decide how many of those 3 are relevant, and how many of the relevant items are in those 3. Precision and recall are not the same in general. If the number of items deemed relevant is equal to "@", then precision will equal recall, yes. And that is usually true for data with ratings, the way this class works. It will just choose some "@" of the items, as there is no basis to call one more relevant than the other. Choosing that many is also somewhat arbitrary; it can't be 0, and can't be all items (or there would no training data from the user under test), so that looked like a nice round number.
        Hide
        Anatoliy Kats added a comment -

        Actually I see the difference between precision and recall still remains, but not in boolean case.

        Show
        Anatoliy Kats added a comment - Actually I see the difference between precision and recall still remains, but not in boolean case.
        Hide
        Anatoliy Kats added a comment -

        That's a good point, we should be careful about how we analyze undersampled data. The purpose of measuring reach is to predict what percentage of the audience in a production system will get a required number of recommendations. Actually I think the easiest way to do this is to loop over the users, and try to generate recommendation on the model that does not exclude any preferences.

        Also, in the spirit of creating conditions maximally similar to a production environment, it seems unfair to exclude users because the evaluator judges there are not enough preferences remaining (lines 116-118 in the patched code). The recommender should decide for itself whether or not to generate anything. Only if it refuses to generate the required number of recommendations do we exclude the user from the IR statistics. This kind of a change would always make precision and recall equal. They often are in practice. What was the original motivation for including both statistics?

        Show
        Anatoliy Kats added a comment - That's a good point, we should be careful about how we analyze undersampled data. The purpose of measuring reach is to predict what percentage of the audience in a production system will get a required number of recommendations. Actually I think the easiest way to do this is to loop over the users, and try to generate recommendation on the model that does not exclude any preferences. Also, in the spirit of creating conditions maximally similar to a production environment, it seems unfair to exclude users because the evaluator judges there are not enough preferences remaining (lines 116-118 in the patched code). The recommender should decide for itself whether or not to generate anything. Only if it refuses to generate the required number of recommendations do we exclude the user from the IR statistics. This kind of a change would always make precision and recall equal. They often are in practice. What was the original motivation for including both statistics?
        Hide
        Sean Owen added a comment -

        This is fine, though, don't you want to count like so?

        if (numRecommendedItems > 0)

        { reach++; }

        Otherwise it seems like you're just counting all users, except the ones that the test framework couldn't test due to sampling size, which is something else and something you want to ignore in general. (So actually I think I'd count both reach and the number of users that the test framework succeeded for). Is that about right for you?

        Show
        Sean Owen added a comment - This is fine, though, don't you want to count like so? if (numRecommendedItems > 0) { reach++; } Otherwise it seems like you're just counting all users, except the ones that the test framework couldn't test due to sampling size, which is something else and something you want to ignore in general. (So actually I think I'd count both reach and the number of users that the test framework succeeded for). Is that about right for you?
        Anatoliy Kats made changes -
        Field Original Value New Value
        Attachment MAHOUT-925.patch [ 12507010 ]
        Anatoliy Kats created issue -

          People

          • Assignee:
            Sean Owen
            Reporter:
            Anatoliy Kats
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development