Mahout
  1. Mahout
  2. MAHOUT-906

Allow collaborative filtering evaluators to use custom logic in splitting data set

    Details

      Description

      I want to start a discussion about factoring out the logic used in splitting the data set into training and testing. Here is how things stand: There are two independent evaluator based classes: AbstractDifferenceRecommenderEvaluator, splits all the preferences randomly into a training and testing set. GenericRecommenderIRStatsEvaluator takes one user at a time, removes their top AT preferences, and counts how many of them the system recommends back.

      I have two use cases that both deal with temporal dynamics. In one case, there may be expired items that can be used for building a training model, but not a test model. In the other, I may want to simulate the behavior of a real system by building a preference matrix on days 1-k, and testing on the ratings the user generated on the day k+1. In this case, it's not items, but preferences(user, item, rating triplets) which may belong only to the training set. Before we discuss appropriate design, are there any other use cases we need to keep in mind?

      1. MAHOUT-906.patch
        14 kB
        Anatoliy Kats
      2. MAHOUT-906.patch
        8 kB
        Anatoliy Kats
      3. MAHOUT-906.patch
        8 kB
        Anatoliy Kats
      4. MAHOUT-906.patch
        14 kB
        Sean Owen
      5. MAHOUT-906.patch
        8 kB
        Anatoliy Kats

        Activity

        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1280 (See https://builds.apache.org/job/Mahout-Quality/1280/)
        MAHOUT-906 add hook for different relevant item ID logic

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225649
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/eval/RelevantItemsDataSplitter.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/eval/GenericRecommenderIRStatsEvaluator.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/eval/GenericRelevantItemsDataSplitter.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1280 (See https://builds.apache.org/job/Mahout-Quality/1280/ ) MAHOUT-906 add hook for different relevant item ID logic srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225649 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/eval/RelevantItemsDataSplitter.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/eval/GenericRecommenderIRStatsEvaluator.java /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/eval/GenericRelevantItemsDataSplitter.java
        Hide
        Sean Owen added a comment -

        OK, I do understand the different computation for "size". As is, it's equivalent to the original, so I used the patched version. For "numItems" I think the new formula will slightly overcount, as it is used. It only affects fall-out anyway, so I left it as-is.

        Show
        Sean Owen added a comment - OK, I do understand the different computation for "size". As is, it's equivalent to the original, so I used the patched version. For "numItems" I think the new formula will slightly overcount, as it is used. It only affects fall-out anyway, so I left it as-is.
        Hide
        Sean Owen added a comment -

        OK. I'm ready to commit the hook, with minor changes. The only question I have remaining about the change, is the change in definition of 'numItems' and 'size'. I think these were as intended, as they were. Am I missing the motivation for the change?

        Show
        Sean Owen added a comment - OK. I'm ready to commit the hook, with minor changes. The only question I have remaining about the change, is the change in definition of 'numItems' and 'size'. I think these were as intended, as they were. Am I missing the motivation for the change?
        Hide
        Anatoliy Kats added a comment -

        You're right, old data should be excluded from the test corpus. The point of a time-based algorithm is to treat older data differently from newer data. So, if there is data it chooses to disregard, it should be be done at the recommender level, not the evaluator level. However, we still need a test start and end date: the entire idea is that we use only use preferences prior to the "current" time to make a recommendation. In production, that's always the case. In a test environment we need to approximate it using a sliding window approach: train on days 1-20, test on day 21, then train on 1-21 test on 22, etc.

        I don't mind using my local modifications of the time-based splitter, so long as the trunk maintains the hook. I see the motivation as well for putting the hook in the abstract difference recommender. If I get around to using a preference-value recommender, I'll look into it.

        Show
        Anatoliy Kats added a comment - You're right, old data should be excluded from the test corpus. The point of a time-based algorithm is to treat older data differently from newer data. So, if there is data it chooses to disregard, it should be be done at the recommender level, not the evaluator level. However, we still need a test start and end date: the entire idea is that we use only use preferences prior to the "current" time to make a recommendation. In production, that's always the case. In a test environment we need to approximate it using a sliding window approach: train on days 1-20, test on day 21, then train on 1-21 test on 22, etc. I don't mind using my local modifications of the time-based splitter, so long as the trunk maintains the hook. I see the motivation as well for putting the hook in the abstract difference recommender. If I get around to using a preference-value recommender, I'll look into it.
        Hide
        Sean Owen added a comment -

        Old data can and should just be excluded from the test corpus, full stop. Or, to put it another way: that's not specific to a time-based split, and in general the need is already met earlier in the pipeline, so to speak. Why is an end date needed... seems best to just use all the recent data you have.

        Understand about also including score in addition to time in the definition of "relevant". I think the problem I'm having is that the "time+score" implementation seems of nearly the same value as the current implementation, which is just "score". The time bit seems secondary, and partially used as a simple filter on the input. So I'm struggling a bit to like just this change; the new implementation is mostly a copy. It's on the borderline of being something you may just want to use locally for your own purpose.

        Any other thoughts from anyone else here?

        I suppose the time-based split makes a lot more sense to me for the estimation-based test and can clearly see the use in a hook and second implementation there. No question about that.

        Show
        Sean Owen added a comment - Old data can and should just be excluded from the test corpus, full stop. Or, to put it another way: that's not specific to a time-based split, and in general the need is already met earlier in the pipeline, so to speak. Why is an end date needed... seems best to just use all the recent data you have. Understand about also including score in addition to time in the definition of "relevant". I think the problem I'm having is that the "time+score" implementation seems of nearly the same value as the current implementation, which is just "score". The time bit seems secondary, and partially used as a simple filter on the input. So I'm struggling a bit to like just this change; the new implementation is mostly a copy. It's on the borderline of being something you may just want to use locally for your own purpose. Any other thoughts from anyone else here? I suppose the time-based split makes a lot more sense to me for the estimation-based test and can clearly see the use in a hook and second implementation there. No question about that.
        Hide
        Anatoliy Kats added a comment -

        You're right, the copy-paste is a bad sign, but I don't quite know how to fix it. I do want a constructor with at least three distinct times. In a real system, preferences older than a certain age might be deleted as irrelevant. A simple way to emulate that is to test using a sliding window: Days 1-30 training, day 31 testing, then 2-31 training, 32 testing, etc. So, I'd need a start date, split date, and end date. Relevance threshold is here for the same reason as it is in the generic splitter – we dont' want to test on negatively rated items. I think storing it in the splitter classes is a good idea. Perhaps we could create an abstract class that leafs through a user's preferences and returns a sorted list of those above the threshold? Then we can use that function in our splitters.

        Show
        Anatoliy Kats added a comment - You're right, the copy-paste is a bad sign, but I don't quite know how to fix it. I do want a constructor with at least three distinct times. In a real system, preferences older than a certain age might be deleted as irrelevant. A simple way to emulate that is to test using a sliding window: Days 1-30 training, day 31 testing, then 2-31 training, 32 testing, etc. So, I'd need a start date, split date, and end date. Relevance threshold is here for the same reason as it is in the generic splitter – we dont' want to test on negatively rated items. I think storing it in the splitter classes is a good idea. Perhaps we could create an abstract class that leafs through a user's preferences and returns a sorted list of those above the threshold? Then we can use that function in our splitters.
        Hide
        Sean Owen added a comment -

        After looking at this more I'm not sure this is the right refactoring. The time-based implementation is parameterized by four times. Shouldn't it be 1? before that time is training data, after that is test? It still splits on a relevance threshold - shouldn't it just be based on time? Splitting this logic out, so that the relevance-threshold business only exists in one implementation, will take more surgery than this. For example, evaluate() shouldn't have a relevance threshold parameter anymore, then, since that would be a property of one particular splitter strategy, specified in the constructor. The resulting time-based implementation will then not be quite so much copy and paste, which is good, as it's a symptom of this not quite splitting up the logic completely. I think that would make the change worth committing.

        While I can and have changed it locally, you'll want to watch the spacing, formatting, and ensure the copyright headers are in place. Also the implementations should live in .impl.eval, not .eval.

        It would be better still to also refactor the test/training data split as well, not just relevant items. Consider that a bonus.

        Show
        Sean Owen added a comment - After looking at this more I'm not sure this is the right refactoring. The time-based implementation is parameterized by four times. Shouldn't it be 1? before that time is training data, after that is test? It still splits on a relevance threshold - shouldn't it just be based on time? Splitting this logic out, so that the relevance-threshold business only exists in one implementation, will take more surgery than this. For example, evaluate() shouldn't have a relevance threshold parameter anymore, then, since that would be a property of one particular splitter strategy, specified in the constructor. The resulting time-based implementation will then not be quite so much copy and paste, which is good, as it's a symptom of this not quite splitting up the logic completely. I think that would make the change worth committing. While I can and have changed it locally, you'll want to watch the spacing, formatting, and ensure the copyright headers are in place. Also the implementations should live in .impl.eval, not .eval. It would be better still to also refactor the test/training data split as well, not just relevant items. Consider that a bonus.
        Hide
        Sean Owen added a comment -

        Oh I see, this actually doesn't implement test/training splitting, which is the other thing I thought should be separated.
        Well, why don't I have a crack at fixing up the formatting and such and adding that, and then committing the whole lot.

        Show
        Sean Owen added a comment - Oh I see, this actually doesn't implement test/training splitting, which is the other thing I thought should be separated. Well, why don't I have a crack at fixing up the formatting and such and adding that, and then committing the whole lot.
        Hide
        Sean Owen added a comment -

        Do you mind if I end up splitting these interfaces? The name "IRStatsEvaluationDataSplitter" still just describes one thing it does and not the other.

        Show
        Sean Owen added a comment - Do you mind if I end up splitting these interfaces? The name "IRStatsEvaluationDataSplitter" still just describes one thing it does and not the other.
        Hide
        Anatoliy Kats added a comment -

        Yeah, I was thinking about that too. The reason I decided to keep them together is that they work together to create something like a training set, and something like a testing set. Maybe we should just call it IRStatsEvaluationDataSplitter?

        Show
        Anatoliy Kats added a comment - Yeah, I was thinking about that too. The reason I decided to keep them together is that they work together to create something like a training set, and something like a testing set. Maybe we should just call it IRStatsEvaluationDataSplitter?
        Hide
        Sean Owen added a comment -

        OK the problem I'm still having with it, which is a small matter of organization, is that this plucks out two unrelated pieces of code to refactor and places them into one interface with two methods. For example it's called "RelevantItemsDataSplitter" when that only describes 1 of the 2 roles it plays.

        Can this be two interfaces, since these are two different roles conceptually? For example the "processOneUser()" ought to be in a different interface and have a more descriptive name I'd imagine.

        If you can please use 2 spaces instead of tab for formatting.

        Otherwise this is pretty uncontroversial, just pulling out some code and leaving a hook in its place, which seems just fine.

        Show
        Sean Owen added a comment - OK the problem I'm still having with it, which is a small matter of organization, is that this plucks out two unrelated pieces of code to refactor and places them into one interface with two methods. For example it's called "RelevantItemsDataSplitter" when that only describes 1 of the 2 roles it plays. Can this be two interfaces, since these are two different roles conceptually? For example the "processOneUser()" ought to be in a different interface and have a more descriptive name I'd imagine. If you can please use 2 spaces instead of tab for formatting. Otherwise this is pretty uncontroversial, just pulling out some code and leaving a hook in its place, which seems just fine.
        Hide
        Anatoliy Kats added a comment -

        This includes the split by time. I haven't added unit tests. However, I picked a date range, and separated the entries in that range from my data using awk. Then I ran an evaluation algorithm on the entire data set, and this smaller file, both times using the date range of the smaller files. The results were identical, after I changed GenericDataModel to sort the output of GetPreferencesFromUser.

        Show
        Anatoliy Kats added a comment - This includes the split by time. I haven't added unit tests. However, I picked a date range, and separated the entries in that range from my data using awk. Then I ran an evaluation algorithm on the entire data set, and this smaller file, both times using the date range of the smaller files. The results were identical, after I changed GenericDataModel to sort the output of GetPreferencesFromUser.
        Hide
        Anatoliy Kats added a comment -

        Not yet, just a refactoring so far. Still working on it.

        Show
        Anatoliy Kats added a comment - Not yet, just a refactoring so far. Still working on it.
        Hide
        Sean Owen added a comment -

        Sorry did not mean to make you re-upload, just checking for completeness.
        This patch doesn't have any new logic for time-based splitting though?

        Show
        Sean Owen added a comment - Sorry did not mean to make you re-upload, just checking for completeness. This patch doesn't have any new logic for time-based splitting though?
        Hide
        Sean Owen added a comment -

        (The patch isn't marked for inclusion in the project – I assume you intended it for commit.)

        Show
        Sean Owen added a comment - (The patch isn't marked for inclusion in the project – I assume you intended it for commit.)
        Hide
        Anatoliy Kats added a comment -

        Here it is.

        Show
        Anatoliy Kats added a comment - Here it is.
        Hide
        Sean Owen added a comment -

        OK shall I wait for a complete patch?

        Show
        Sean Owen added a comment - OK shall I wait for a complete patch?
        Hide
        Anatoliy Kats added a comment -

        Also, I feel like my class names are unnecessarily long. Feel free to edit before committing.

        Show
        Anatoliy Kats added a comment - Also, I feel like my class names are unnecessarily long. Feel free to edit before committing.
        Hide
        Anatoliy Kats added a comment -

        Ah, I see what you mean. That would be neat indeed, because we could unify the split for both classes of evaluators. My concern, though, is that it's hard for me to imagine a case where I could use the same splitting strategy for both tests – IRStatsTest currently requires that we test on one user at a time, for instance. In general my feeling is that the nature of the tests requires different splitting strategies. I'd feel more comfortable keeping things the way they are for the time being while we get some experience with the different splitting strategies that our users might need. Afterwards, if it becomes apparent your proposal is in our users' interest, we can deprecate the API in my patch, and write a wrapper class that takes my IRDataSplitter and converts it to your splitter.

        Show
        Anatoliy Kats added a comment - Ah, I see what you mean. That would be neat indeed, because we could unify the split for both classes of evaluators. My concern, though, is that it's hard for me to imagine a case where I could use the same splitting strategy for both tests – IRStatsTest currently requires that we test on one user at a time, for instance. In general my feeling is that the nature of the tests requires different splitting strategies. I'd feel more comfortable keeping things the way they are for the time being while we get some experience with the different splitting strategies that our users might need. Afterwards, if it becomes apparent your proposal is in our users' interest, we can deprecate the API in my patch, and write a wrapper class that takes my IRDataSplitter and converts it to your splitter.
        Hide
        Sean Owen added a comment -

        This is a sketch of what I had in mind. It is lacking the implementation but suggests how one method might be used for both. I could be overlooking some detail of why this wouldn't work. The configuration params would re-appear in implementations.

        Show
        Sean Owen added a comment - This is a sketch of what I had in mind. It is lacking the implementation but suggests how one method might be used for both. I could be overlooking some detail of why this wouldn't work. The configuration params would re-appear in implementations.
        Hide
        Sean Owen added a comment -

        In both cases you have data to put into a model and some held out. For IR you're holding out just a few items, from one user, but it still fits that description. The input is prefs and the output is two subsets of those prefs.

        The benefit would be implementing a strategy once, not twice, for the two interface methods. I would envision an interface with one method like...

        split(DataModel dataModel, FastByIDMap<PreferenceArray> trainingData, FastByIDMap<PreferenceArray> testData);

        The two tests would do different things with those results, yes. It's going to be 10 lines of change not 1.

        It's not a big deal just seems easier on you.

        Show
        Sean Owen added a comment - In both cases you have data to put into a model and some held out. For IR you're holding out just a few items, from one user, but it still fits that description. The input is prefs and the output is two subsets of those prefs. The benefit would be implementing a strategy once, not twice, for the two interface methods. I would envision an interface with one method like... split(DataModel dataModel, FastByIDMap<PreferenceArray> trainingData, FastByIDMap<PreferenceArray> testData); The two tests would do different things with those results, yes. It's going to be 10 lines of change not 1. It's not a big deal just seems easier on you.
        Hide
        Anatoliy Kats added a comment -

        I don't quite see how. We are not exactly splitting into training and test sets under the one-user testing paradigm. We have a split for the current user, and we have a possibly different method for choosing which preferences will make it into the training model. So, it's not a matter of a split on a parameter, like relevance threshold. I kind of see how this is a possibility, but I don't fully see the purpose or motivation, so I can't propose a specific API. What exactly do you want to see, and why?

        Show
        Anatoliy Kats added a comment - I don't quite see how. We are not exactly splitting into training and test sets under the one-user testing paradigm. We have a split for the current user, and we have a possibly different method for choosing which preferences will make it into the training model. So, it's not a matter of a split on a parameter, like relevance threshold. I kind of see how this is a possibility, but I don't fully see the purpose or motivation, so I can't propose a specific API. What exactly do you want to see, and why?
        Hide
        Sean Owen added a comment -

        Yes that's a good start. Do you think it's possible and desirable to use one single split method for both the estimate-based and IR test? That would be tidy.

        Show
        Sean Owen added a comment - Yes that's a good start. Do you think it's possible and desirable to use one single split method for both the estimate-based and IR test? That would be tidy.
        Hide
        Anatoliy Kats added a comment -

        Passes GenericRecommenderIRStatsEvaluatorImplTest, didn't run the entire suite.

        Show
        Anatoliy Kats added a comment - Passes GenericRecommenderIRStatsEvaluatorImplTest, didn't run the entire suite.
        Hide
        Sean Owen added a comment -

        OK, sounds like you want to replace more logic, but that's not much harder. Even in the IR test, it's still a question of separating data into test and training data. Hopefully you can then re-use the same test-training split logic for both objects. You do not need to modify any Recommender here.

        Show
        Sean Owen added a comment - OK, sounds like you want to replace more logic, but that's not much harder. Even in the IR test, it's still a question of separating data into test and training data. Hopefully you can then re-use the same test-training split logic for both objects. You do not need to modify any Recommender here.
        Hide
        Anatoliy Kats added a comment -

        OK, I think I got it(again). We need to factor out relevant ID computation and processOtherUser out of GenericRecommenderIRStatsEvaluator. Then we could pass in a class that knows the split date, and split the current and other users according to that.

        Show
        Anatoliy Kats added a comment - OK, I think I got it(again). We need to factor out relevant ID computation and processOtherUser out of GenericRecommenderIRStatsEvaluator. Then we could pass in a class that knows the split date, and split the current and other users according to that.
        Hide
        Anatoliy Kats added a comment -

        You've actually convinced me to change the estimate test. Changing relevantItemIDs will not do what I am trying to do. Suppose John is the current user, and I restrict his relevant items to be the preferences he expressed before Dec 5th. I can remove them, sure, but then I'll be building a model based on all other users' preferences: I'd be using Mary's preferences from Dec 10th to predict John's on Dec 5th. That misses the entire point, that user preferences are dynamic, and change over time. The production algorithm will not be able to predict backwards, so our test environment should not allow that either. That's why I want to have all the data before a fixed time to be my training set, and the data for the following time period to be my test said.

        Which recommender do you think I should modify to make that happen?

        Show
        Anatoliy Kats added a comment - You've actually convinced me to change the estimate test. Changing relevantItemIDs will not do what I am trying to do. Suppose John is the current user, and I restrict his relevant items to be the preferences he expressed before Dec 5th. I can remove them, sure, but then I'll be building a model based on all other users' preferences: I'd be using Mary's preferences from Dec 10th to predict John's on Dec 5th. That misses the entire point, that user preferences are dynamic, and change over time. The production algorithm will not be able to predict backwards, so our test environment should not allow that either. That's why I want to have all the data before a fixed time to be my training set, and the data for the following time period to be my test said. Which recommender do you think I should modify to make that happen?
        Hide
        Sean Owen added a comment -

        No I think it's as simple as factoring out this section of code in GenericRecommenderIRStatsEvaluator, that's all:

        FastIDSet relevantItemIDs = new FastIDSet(at);

        // List some most-preferred items that would count as (most) "relevant" results
        double theRelevanceThreshold = Double.isNaN(relevanceThreshold) ? computeThreshold(prefs) : relevanceThreshold;

        prefs.sortByValueReversed();

        for (int i = 0; i < size && relevantItemIDs.size() < at; i++) {
        if (prefs.getValue >= theRelevanceThreshold)

        { relevantItemIDs.add(prefs.getItemID(i)); }

        }

        I had thought you were not changing the estimated-based test? but there it is a matter of factoring out what splitOneUsersPrefs() does.

        Show
        Sean Owen added a comment - No I think it's as simple as factoring out this section of code in GenericRecommenderIRStatsEvaluator, that's all: FastIDSet relevantItemIDs = new FastIDSet(at); // List some most-preferred items that would count as (most) "relevant" results double theRelevanceThreshold = Double.isNaN(relevanceThreshold) ? computeThreshold(prefs) : relevanceThreshold; prefs.sortByValueReversed(); for (int i = 0; i < size && relevantItemIDs.size() < at; i++) { if (prefs.getValue >= theRelevanceThreshold) { relevantItemIDs.add(prefs.getItemID(i)); } } I had thought you were not changing the estimated-based test? but there it is a matter of factoring out what splitOneUsersPrefs() does.
        Hide
        Anatoliy Kats added a comment -

        I have to head out, let me ask you a question before I do. Based on what I saw so far, it seems that I need to factor out AbstractDifferenceRecommenderEvaluator::call() to check for relevant items instead of simply estimating the preference. If that's indeed the case, how do you feel about that?

        Show
        Anatoliy Kats added a comment - I have to head out, let me ask you a question before I do. Based on what I saw so far, it seems that I need to factor out AbstractDifferenceRecommenderEvaluator::call() to check for relevant items instead of simply estimating the preference. If that's indeed the case, how do you feel about that?
        Hide
        Anatoliy Kats added a comment -

        I hope this test gives a more realistic result because sometimes preferences change with time, and this test is in support of recommendation algorithms that can take advantage of this temporal dynamics.

        I think all I need to do is make splitOneUsersPrefs protected instead of private, and then extend the AbstractDifferenceRecommenderEvaluator. I'm going to get started now.

        Show
        Anatoliy Kats added a comment - I hope this test gives a more realistic result because sometimes preferences change with time, and this test is in support of recommendation algorithms that can take advantage of this temporal dynamics. I think all I need to do is make splitOneUsersPrefs protected instead of private, and then extend the AbstractDifferenceRecommenderEvaluator. I'm going to get started now.
        Hide
        Sean Owen added a comment -

        Sure, you can do that. I am not sure that gives you a better result than anything other split, but at least it is no worse. I'm reluctant to add a hook that just lets someone create an invalid test (with ratings) but if that can be worked around (throw an exception if the model has no pref values?) then that's mitigated.

        This would really be much more useful in the estimate-based evaluation though, not here. I thought that's what this was about.

        I still have no understanding of why this refactoring involves adding new eval classes or moving or renaming them. It just factoring out a piece of logic that splits a set into two subsets. It ought not be more complex than that.

        Show
        Sean Owen added a comment - Sure, you can do that. I am not sure that gives you a better result than anything other split, but at least it is no worse. I'm reluctant to add a hook that just lets someone create an invalid test (with ratings) but if that can be worked around (throw an exception if the model has no pref values?) then that's mitigated. This would really be much more useful in the estimate-based evaluation though, not here. I thought that's what this was about. I still have no understanding of why this refactoring involves adding new eval classes or moving or renaming them. It just factoring out a piece of logic that splits a set into two subsets. It ought not be more complex than that.
        Hide
        Anatoliy Kats added a comment -

        You're right, if you have preference values, picking relevant items by recency does not work. But, if you want to test for temporal dynamics, here is a test that works for boolean and rated data:

        1. Let days 1...(i-1) be your training data
        2. Let day i be your test data
        3. Calculate an evaluation metric of the test on a recommender created by the training data.

        For rated data, you use the difference metric and that's already implemented. For boolean data, you use the IR metric that I proposed earlier: For each user, generate as many recommendations as there are in the test set. Compute IR statistics based on the intersection of actual preferences, and generated recommendations.

        The easiest way to do that seems to be factoring out the data splitting from the AbstractDifferenceRecommenderEvaluator, which should be renamed, if we go ahead, to reflect its new role.

        Show
        Anatoliy Kats added a comment - You're right, if you have preference values, picking relevant items by recency does not work. But, if you want to test for temporal dynamics, here is a test that works for boolean and rated data: 1. Let days 1...(i-1) be your training data 2. Let day i be your test data 3. Calculate an evaluation metric of the test on a recommender created by the training data. For rated data, you use the difference metric and that's already implemented. For boolean data, you use the IR metric that I proposed earlier: For each user, generate as many recommendations as there are in the test set. Compute IR statistics based on the intersection of actual preferences, and generated recommendations. The easiest way to do that seems to be factoring out the data splitting from the AbstractDifferenceRecommenderEvaluator, which should be renamed, if we go ahead, to reflect its new role.
        Hide
        Sean Owen added a comment -

        For the IR precision/recall evaluation, if you do have preference values, then picking the "relevant" items by recency doesn't work. It's not true that these are the best answers that the recommendation can give. (It's not even quite true that the highest-rated items are the best recommendations!) Imagine you just rated several movies you hate. You would not want to score a recommender more highly because it recommends these.

        If you don't have preference values, then picking the relevant items is arbitrary. Picking by time is as good as randomly picking. But, that means you're not expecting to gain anything by splitting by time. Any selection is about equivalent.

        So I think this doesn't help your use of the IR test, and creates a bad test for other use cases. I do think it could be meaningful for the estimation test.

        Even if I thought it were a good test, I don't see the need for a new class. This is merely extracting how "relevantIDs" is computed. Yes, you can refactor that as you say, but then how does anything else change?

        Show
        Sean Owen added a comment - For the IR precision/recall evaluation, if you do have preference values, then picking the "relevant" items by recency doesn't work. It's not true that these are the best answers that the recommendation can give. (It's not even quite true that the highest-rated items are the best recommendations!) Imagine you just rated several movies you hate. You would not want to score a recommender more highly because it recommends these. If you don't have preference values, then picking the relevant items is arbitrary. Picking by time is as good as randomly picking. But, that means you're not expecting to gain anything by splitting by time. Any selection is about equivalent. So I think this doesn't help your use of the IR test, and creates a bad test for other use cases. I do think it could be meaningful for the estimation test. Even if I thought it were a good test, I don't see the need for a new class. This is merely extracting how "relevantIDs" is computed. Yes, you can refactor that as you say, but then how does anything else change?
        Hide
        Anatoliy Kats added a comment -

        Or maybe we can somehow split the AbstractDifferenceRecommenderEvaluator into an AbstractDataSplitEvaluator and a TestSetEvaluationMetric with a public Object computeEvaluation() ...?

        Show
        Anatoliy Kats added a comment - Or maybe we can somehow split the AbstractDifferenceRecommenderEvaluator into an AbstractDataSplitEvaluator and a TestSetEvaluationMetric with a public Object computeEvaluation() ...?
        Hide
        Anatoliy Kats added a comment -

        I guess it's a hybrid of some sort between estimation and your implementation of the IR tests. You're right, I used the term test data to what you call relevant data, and I called the rest training data. The estimation test does not work for me because at present I am working with boolean preferences. So, taking into account what you just said, I propose a PrefSplitIRStatsEvaluator that does the following:

        1. Split the data into the training and testing set, with the splitting logic factorized into a separate class. This class should probably ensure that only relevant data goes into test set. I have very little experience with non-boolean data, so I'll follow your directions about where to put the relevance check.
        2. Build a model on the training data
        3. Generate the same number of preferences for each user as in the testing data
        4. Compute IR statistics on the intersection of actual and predicted preferences.

        The temporal evaluator I need will then loop over calls this class and aggregate statistics over it. I will certainly implement it for my project, and I do not mind contributing it to Mahout.

        Show
        Anatoliy Kats added a comment - I guess it's a hybrid of some sort between estimation and your implementation of the IR tests. You're right, I used the term test data to what you call relevant data, and I called the rest training data. The estimation test does not work for me because at present I am working with boolean preferences. So, taking into account what you just said, I propose a PrefSplitIRStatsEvaluator that does the following: 1. Split the data into the training and testing set, with the splitting logic factorized into a separate class. This class should probably ensure that only relevant data goes into test set. I have very little experience with non-boolean data, so I'll follow your directions about where to put the relevance check. 2. Build a model on the training data 3. Generate the same number of preferences for each user as in the testing data 4. Compute IR statistics on the intersection of actual and predicted preferences. The temporal evaluator I need will then loop over calls this class and aggregate statistics over it. I will certainly implement it for my project, and I do not mind contributing it to Mahout.
        Hide
        Sean Owen added a comment -

        OK. I think we're speaking about the estimation test, not the IR tests. In the IR test there is not really a notion of training and test data; there are the relevant items and non-relevant items. The 'relevant' items are the ones held out. You could hold out the latest prefs, I guess, though I wonder if this compromises the meaning of the result. It is not necessarily "bad", for example, if the recommender doesn't consider those latest prefs the top recs. That is not what any implementation is trying to do.

        Sorting isn't needed, but it is probably the easiest way to split the data into training and test data. I don't know if it will be much slower than alternatives, and if it's not, fine for eval purposes. TopN is an existing class. It will be faster at picking out the "most recent" prefs for you but I don't know of an easy way to reuse it to also give you the rest of the older objects efficiently. So, I suppose I'd start with a sort, which is probably 10 lines of code, and see if it's fast enough.

        I do not see a need for any new evaluator, no. The point here is to factor out the test/training split logic only, and with that pluggable, you should be able to create test/training splits based on time. No?

        Show
        Sean Owen added a comment - OK. I think we're speaking about the estimation test, not the IR tests. In the IR test there is not really a notion of training and test data; there are the relevant items and non-relevant items. The 'relevant' items are the ones held out. You could hold out the latest prefs, I guess, though I wonder if this compromises the meaning of the result. It is not necessarily "bad", for example, if the recommender doesn't consider those latest prefs the top recs. That is not what any implementation is trying to do. Sorting isn't needed, but it is probably the easiest way to split the data into training and test data. I don't know if it will be much slower than alternatives, and if it's not, fine for eval purposes. TopN is an existing class. It will be faster at picking out the "most recent" prefs for you but I don't know of an easy way to reuse it to also give you the rest of the older objects efficiently. So, I suppose I'd start with a sort, which is probably 10 lines of code, and see if it's fast enough. I do not see a need for any new evaluator, no. The point here is to factor out the test/training split logic only, and with that pluggable, you should be able to create test/training splits based on time. No?
        Hide
        Anatoliy Kats added a comment -

        The IR tests make recommendations for one user at a time, true, but they build a model based on all other users to make a recommendation for the one. So, as we try to recover each preference P, we build a model based on all users, and all preferences expressed earlier than time(P).

        You're right, sorting is not necessary, because usually it's assumed that preferences stay constant during some time period, say, a day. Is there an existing TopN class you are referring to, or should I write my own?

        I am thinking I need to write a brand new evaluator and make the existing GenericRecommenderIRStatsEvaluator its subclass, rather than the other way around. The reason is that the outer loop of a temporal evaluator class is over the time range of preferences, and only then over the users like GenericRecommenderIRStatsEvaluator. It's natural to see the generic evaluator as a special case of the temporal one, with one pass over the outer loop. What do you think?

        So, I'd write a loop like this:
        for i in 1...N:
        let training data be bottom (i/N * 100)% by time.
        let testing data be between (i/N, (i+1)/N)*100%
        (Alternatively, split by a time period, s.t. let days 1...i be training, and i+1 be testing)
        Generate the same number of preferences for each user as in the testing data
        Compute IR statistics on the intersection of actual and predicted preferences.

        How does that sound?

        Show
        Anatoliy Kats added a comment - The IR tests make recommendations for one user at a time, true, but they build a model based on all other users to make a recommendation for the one. So, as we try to recover each preference P, we build a model based on all users, and all preferences expressed earlier than time(P) . You're right, sorting is not necessary, because usually it's assumed that preferences stay constant during some time period, say, a day. Is there an existing TopN class you are referring to, or should I write my own? I am thinking I need to write a brand new evaluator and make the existing GenericRecommenderIRStatsEvaluator its subclass, rather than the other way around. The reason is that the outer loop of a temporal evaluator class is over the time range of preferences, and only then over the users like GenericRecommenderIRStatsEvaluator. It's natural to see the generic evaluator as a special case of the temporal one, with one pass over the outer loop. What do you think? So, I'd write a loop like this: for i in 1...N: let training data be bottom (i/N * 100)% by time. let testing data be between (i/N, (i+1)/N)*100% (Alternatively, split by a time period, s.t. let days 1...i be training, and i+1 be testing) Generate the same number of preferences for each user as in the testing data Compute IR statistics on the intersection of actual and predicted preferences. How does that sound?
        Hide
        Sean Owen added a comment -

        Are we talking about the IR tests, estimation test or both? The IR tests operate on one user's info at a time. The estimation test does operate on all prefs at once.

        For the either case, I think using the TopN class to select your top fraction of test data works reasonably well. I suppose I'm only wondering how you then get the training data, since you need to remove the test data from it... Hmm. Maybe just providing a sort works fine then. Your Comparable can use getPreferenceTime(), and see if that's reasonably fast.

        Show
        Sean Owen added a comment - Are we talking about the IR tests, estimation test or both? The IR tests operate on one user's info at a time. The estimation test does operate on all prefs at once. For the either case, I think using the TopN class to select your top fraction of test data works reasonably well. I suppose I'm only wondering how you then get the training data, since you need to remove the test data from it... Hmm. Maybe just providing a sort works fine then. Your Comparable can use getPreferenceTime(), and see if that's reasonably fast.
        Hide
        Anatoliy Kats added a comment -

        We will be sorting preferences by time for ALL users. The reason is that for each preference, we will need to train a recommender on all earlier preferences, like this:

        1. Sort the top "at" preferences for a user
        2. for the i'th pref in sorted-prefs
        a. generate a training model for all users using ealier preferences
        b. generate i recommendations
        c. increment intersection if one of the recommendations matches the ith actual pref.
        3. Calculate IR statistics as before.

        Is this the correct logic? Is it enough to run an O(prefs^2) sort for reasonable-size datasets, or should we pre-sort preferences by time? If we should sort, I'll do it using Comparable at first, but really we need to call a radix sort of some kind.

        Show
        Anatoliy Kats added a comment - We will be sorting preferences by time for ALL users. The reason is that for each preference, we will need to train a recommender on all earlier preferences, like this: 1. Sort the top "at" preferences for a user 2. for the i'th pref in sorted-prefs a. generate a training model for all users using ealier preferences b. generate i recommendations c. increment intersection if one of the recommendations matches the ith actual pref. 3. Calculate IR statistics as before. Is this the correct logic? Is it enough to run an O(prefs^2) sort for reasonable-size datasets, or should we pre-sort preferences by time? If we should sort, I'll do it using Comparable at first, but really we need to call a radix sort of some kind.
        Hide
        Sean Owen added a comment -

        Yes, the lightest-touch approach is to pull them out into an array of Preference-plus-time objects, and sort. DataModel does not need a sort by time method. At best you could add getTime() to Preference, which returns 0 by default in current implementations, and then create a new subclass of GenericPreference with time. It is sortable then with Comparable, or you could add sortByTime() to PreferenceArray.

        O(pref^2) is not a problem when there are maybe 100 prefs for a user, and in an eval framework. But if the task is just to split them into objects before some time, and after some time, then you would not be doing anything with single prefs. Instead of sorting, using the TopN class to select the top, say, 5% by time is not hard, and will make O(prefs) calls to getPreferenceTime(), which is efficient. No wrapper objects and such needed.

        You won't necessarily know when the data is sorted, without explicit time info. For example they don't appear in time order in a file, and anyway that ordering is ignored at reading.

        Show
        Sean Owen added a comment - Yes, the lightest-touch approach is to pull them out into an array of Preference-plus-time objects, and sort. DataModel does not need a sort by time method. At best you could add getTime() to Preference, which returns 0 by default in current implementations, and then create a new subclass of GenericPreference with time. It is sortable then with Comparable, or you could add sortByTime() to PreferenceArray. O(pref^2) is not a problem when there are maybe 100 prefs for a user, and in an eval framework. But if the task is just to split them into objects before some time, and after some time, then you would not be doing anything with single prefs. Instead of sorting, using the TopN class to select the top, say, 5% by time is not hard, and will make O(prefs) calls to getPreferenceTime(), which is efficient. No wrapper objects and such needed. You won't necessarily know when the data is sorted, without explicit time info. For example they don't appear in time order in a file, and anyway that ordering is ignored at reading.
        Hide
        Anatoliy Kats added a comment -

        I am beginning to write an evaluator that takes time of preference into account. It seems that the first thing I need is to sort all the preferences by their time. I don't quite see how to do it without tearing into Mahout's logic. The DataModel interface only makes a contract that we can pull one time of preferece at a time, with getPreferenceTime. I see three options. I can pull them out one at a time into a data structure that is sortable inside the evaluator class. Or I could add a sortByTimePreference directly to the DataModel interface. The latter is a lot more involved, and I can only do it in close collaboration with you. The third option is to avoid sorting altogether. Under this option, for every test user, and for every one of their preferences, loop over all the other preferences, and add only those that were made earlier to the training set. This option is fastest to code, but slowest to run, in O((#prefs)^2). For the testing alone, it works for me, so that's what I'll probably do. Here is another compromise option: Data is normally accumulated by the servers in order of preference-making, so the input is already approximately sorted. If Mahout can preserve the order in which it read the data off the disk, we have sorted data without ever having to sort. In fact maybe it already does and all we have to do is contractualize it. Is this kind of sorting necessary for a time-based algorithm itself, or only for the evaluation? If it's the former, perhaps we can look into implementing this.

        Show
        Anatoliy Kats added a comment - I am beginning to write an evaluator that takes time of preference into account. It seems that the first thing I need is to sort all the preferences by their time. I don't quite see how to do it without tearing into Mahout's logic. The DataModel interface only makes a contract that we can pull one time of preferece at a time, with getPreferenceTime. I see three options. I can pull them out one at a time into a data structure that is sortable inside the evaluator class. Or I could add a sortByTimePreference directly to the DataModel interface. The latter is a lot more involved, and I can only do it in close collaboration with you. The third option is to avoid sorting altogether. Under this option, for every test user, and for every one of their preferences, loop over all the other preferences, and add only those that were made earlier to the training set. This option is fastest to code, but slowest to run, in O((#prefs)^2). For the testing alone, it works for me, so that's what I'll probably do. Here is another compromise option: Data is normally accumulated by the servers in order of preference-making, so the input is already approximately sorted. If Mahout can preserve the order in which it read the data off the disk, we have sorted data without ever having to sort. In fact maybe it already does and all we have to do is contractualize it. Is this kind of sorting necessary for a time-based algorithm itself, or only for the evaluation? If it's the former, perhaps we can look into implementing this.
        Hide
        Anatoliy Kats added a comment -

        A time-based recommender would be great! However, I think we should take it one step at a time. Start with the evaluation, then implement the algorithms themselves.

        I thought about what it takes to implement an evaluator for a time-based algorithm. As a bare minimum, we need the FileDataModel to read the time and date information, store it in the Preference class. In addition, there may be other considerations that determine whether an item is eligible to be in the test set. In our case, there is the user's browsing history that landed it to an item. We can store whatever we need in additional columns of the input data file. FileDataModel ought to be able to read it, and be easily extendable to a class that processes this extra data. These modification are tricky for a newcomer. I can give it a try, but since I do not have the big picture view, I may mess up something important. I'll start on the obvious modifications while I wait for your comments.

        Show
        Anatoliy Kats added a comment - A time-based recommender would be great! However, I think we should take it one step at a time. Start with the evaluation, then implement the algorithms themselves. I thought about what it takes to implement an evaluator for a time-based algorithm. As a bare minimum, we need the FileDataModel to read the time and date information, store it in the Preference class. In addition, there may be other considerations that determine whether an item is eligible to be in the test set. In our case, there is the user's browsing history that landed it to an item. We can store whatever we need in additional columns of the input data file. FileDataModel ought to be able to read it, and be easily extendable to a class that processes this extra data. These modification are tricky for a newcomer. I can give it a try, but since I do not have the big picture view, I may mess up something important. I'll start on the obvious modifications while I wait for your comments.
        Hide
        Manuel Blechschmidt added a comment -

        Actually it would be a good idea to implement time based splitting. Normally we want a recommender to predict ratings for items that we are going to like in the future and this should be the evaluation basis for the recommendations.

        In an ecommerce scenario you want the recommender to predict the item that you are going to buy next. Therefore you have to hide the newest items.

        The university of hildesheim (Steffen Rendle, Christoph Freudenthaler, Lars Schmidt-Thieme) wrote a paper in 2010 where they are combining SVD + HMM and are able to outperform a standard recommender:
        http://www.ismll.uni-hildesheim.de/pub/pdfs/RendleFreudenthaler2010-FPMC.pdf

        Show
        Manuel Blechschmidt added a comment - Actually it would be a good idea to implement time based splitting. Normally we want a recommender to predict ratings for items that we are going to like in the future and this should be the evaluation basis for the recommendations. In an ecommerce scenario you want the recommender to predict the item that you are going to buy next. Therefore you have to hide the newest items. The university of hildesheim (Steffen Rendle, Christoph Freudenthaler, Lars Schmidt-Thieme) wrote a paper in 2010 where they are combining SVD + HMM and are able to outperform a standard recommender: http://www.ismll.uni-hildesheim.de/pub/pdfs/RendleFreudenthaler2010-FPMC.pdf
        Hide
        Anatoliy Kats added a comment -

        OK. I'll wait for a day to see if anyone has other use cases for me to consider, then propose a design or start working on a patch on Monday.

        Show
        Anatoliy Kats added a comment - OK. I'll wait for a day to see if anyone has other use cases for me to consider, then propose a design or start working on a patch on Monday.
        Hide
        Sean Owen added a comment -

        Yes, I think a clean refactoring of this logic, so that the random selection is one of several pluggable strategies, is just fine. You can attach a patch here.

        Show
        Sean Owen added a comment - Yes, I think a clean refactoring of this logic, so that the random selection is one of several pluggable strategies, is just fine. You can attach a patch here.

          People

          • Assignee:
            Sean Owen
            Reporter:
            Anatoliy Kats
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 48h
              48h
              Remaining:
              Remaining Estimate - 48h
              48h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development