Details

Type: Improvement

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: None

Fix Version/s: 0.4

Component/s: Collaborative Filtering

Labels:None
Description
A summary of the discussion on the mailing list:
Extend the distributed itembased recommender from using only simple cooccurrence counts to using the standard computations of an itembased recommender as defined in Sarwar et al "ItemBased Collaborative Filtering Recommendation Algorithms" (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf).
What the distributed recommender generally does is that it computes the prediction values for all users towards all items those users have not rated yet. And the computation is done in the following way:
u = a user
i = an item not yet rated by u
N = all items cooccurring with i
Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n))
The formula used in the paper which is used by GenericItemBasedRecommender.doEstimatePreference(...) too, looks very similar to the one above:
u = a user
i = an item not yet rated by u
N = all items similar to i (where similarity is usually computed by pairwisely comparing the itemvectors of the useritem matrix)
Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all n from N: abs(similarity(i,n)))
There are only 2 differences:
a) instead of the cooccurrence count, certain similarity measures like pearson or cosine can be used
b) the resulting value is normalized by the sum of the similarities
To overcome difference a) we would only need to replace the part that computes the cooccurrence matrix with the code from ItemSimilarityJob or the code introduced in MAHOUT418, then we could compute arbitrary similarity matrices and use them in the same way the cooccurrence matrix is currently used. We just need to separate steps up to creating the cooccurrence matrix from the rest, which is simple since they're already different MR jobs.
Regarding difference b) from a first look at the implementation I think it should be possible to transfer the necessary similarity matrix entries from the PartialMultiplyMapper to the AggregateAndRecommendReducer to be able to compute the normalization value in the denominator of the formula. This will take a little work, yes, but is still straightforward. It canbe in the "common" part of the process, done after the similarity matrix is generated.
I think work on this issue should wait until MAHOUT418 is resolved as the implementation here depends on how the pairwise similarities will be computed in the future.

 MAHOUT420.patch
 92 kB
 Sebastian Schelter

 MAHOUT4202.patch
 91 kB
 Sebastian Schelter

 MAHOUT4202a.patch
 78 kB
 Sebastian Schelter

 MAHOUT4203.patch
 91 kB
 Sebastian Schelter
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Integrated in MahoutQuality #128 (See http://hudson.zones.apache.org/hudson/job/MahoutQuality/128/)
MAHOUT420
I'm going to commit this since I'm convinced enough you've tested it and streamlined it. I have looked over it again briefly and don't see anything that jumps out at me. I'd like to commit even if later we find some additional changes that are required since that way we can continue to collaborate on these changes rather than pass back and forth patches.
Later I'll need to rewrite some of the book chapter on this but that's fine.
It certainly sounds like it's on the right path. I haven't yet had a chance to look at it.
My broad concerns are whether it performs about as well and doesn't work too differently, but it sounds like those are fine.
I will look at the patch within a few days.
I did some local tests using the 100K movielens dataset, generating 10 recommendations per user and having maxPrefsPerUserConsidered set to 25 and maxCooccurrencesPerItemConsidered/maxSimilaritiesPerItemConsidered set to 25.
I checked the overall running time and the amount of data that was read and written in the partialMultiply and aggregateAndRecommend jobs.
The simple cooccurrencebased recommender finished in approximately one minute and read and wrote about 200MB in the partialMultiply and aggregateAndRecommend jobs. All of my patches needed
about 6 minutes and read and wrote 34 times as much data... I finally figured out that that huge difference was caused by me not pruning the vectors as it was done before in the UserVectorToCooccurrenceMapper.
I added that step and evolved the latest patch (the one that uses vectors instead of custom writables).
I got it to finish the job in one minute too and write about 400MB in the partialMultiply and 300MB in the aggregateAndRecommend step when the computation was done using pearson correlation as similarity . I tried to apply all optimizations you mentioned (like setWritesLaxPrecision(true) on the VectorWritables, no multiplication if the pref is 1 and a special computation method for boolean data). I also found a way to make the patch drop recommendations based on only one data point (the same thing GenericItemBasedRecommender.doEstimatePreference(...) is doing).
Are we on the right path and do you see more optimization potential?
I uploaded an alternative patch, that uses vectors instead of custom writables for the computation in the last step. It should be easier to understand and closer to the current implementation. Unfortunately I couldn't figure out how to write a Combiner for the Aggregate step when doing the computation this way and I also didn't find a way to filter out recommendations that were only based on one data point.
Nevertheless I hope it helps us decide which path to go here.
Hi Sean,
I think you're right, optimization is crucial here and there's still a lot of work to do. I will have a look at the things you pointed out. Meanwhile I will try to answer your questions and explain the way the job works best I can. I hope I can explain it good enough so you can join in on this.
I'm not fully understanding is handling of NaN. You see what was done before  NaN values in vectors were used to exclude items from recommendation. It's a reasonably nice way to do it. What's the equivalent here? I see other bits of code paying attention to NaN.
From the example output attached one can see that there's always a NaN summand mapped for the prediction computation towards items a user already knows, so those will be excluded too, as the predicted preference will be NaN too. Unfortunately this happens only in the last computation stage, I didn't find a way to do it earlier (maybe you see one?).
Are we handling "boolean" preferences efficiently? Before it would avoid the vectortimespreference step when the pref was known to be 1.0, and I don't see that now.
I don't think that the current patch works well for boolean preferences. I will need some time to investigate that (or maybe you could give me some hints).
Finally there is a feature in vectors that will save space, causing it to write float values instead of doubles, since we don't need 64 bits of precision. I also don't see how that's preserved.
I think changing the doubles to floats in PredictionPartWritable will have the same effect as the float values in the vectors you mentioned, that will be on my todo list for an updated patch.
Basically I am not yet sure how the new computation is structured from reading the code. I think some comments on the "Aggregate" jobs would be ideal.
I attached the computations of the unittest example (with the combiner disabled for clarity), I hope the output can provide a clear picture of how the computation is done. Please note that this is an unrealistic example as every item is similar to every other item.
A short summary of the changes:
The cooccurrence matrix has been replaced with the similarity matrix and the PartialMultiplyMapper and the AggregateAndRecommendReducer have changed partly.
The PartialMultiplyMapper receives the preferences (userIDs and prefValues) as well as the column from the similarity matrix for an item.
For each preference and similar item it now maps a summand of the numerator and a summand of the denominator (wrapped in a PredictionPartWritable) of the formula for the prediction of the preference of the user towards the similar item:
i = the current item the PartialMultiplyMapper is looking at
u = a user preferring that item
j = an item similar to i (known from the similarity matrix column)
Prediction(u,j) = ... + preference(u,i) * similarity(i,j) + ... / ... + similarity(i,j) + ...
The AggregateAndRecommendReducer receives all PredictionPartWritables for a user (secondary sorted by item) and can thus compute all the predictions for the user.

useritemmatrix burger hotdog berries icecream dog 5 5 2  rabbit 2  3 5 cow  5  3 donkey 3   5 itemitemsimilaritymatrix (tanimotocoefficient of the itemvectors of the useritemmatrix) burger hotdog berries icecream burger  0.25 0.66 0.5 hotdog 0.25  0.33 0.25 berries 0.66 0.33  0.25 icecream 0.5 0.25 0.25  Prediction(dog, icecream) = (0.5 * 5 + 0.25 * 5 + 0.25 * 2 ) / (0.5 + 0.25 + 0.25) ~ 4.3 Prediction(rabbit, hotdog) = (0.25 * 2 + 0.33 * 3 + 0.25 * 5) / (0.25 + 0.33 + 0.25) ~ 3,3 Prediction(cow, burger) = (0.25 * 5 + 0.5 * 3) / (0.25 + 0.5) ~ 3,7 Prediction(cow, berries) = (0.33 * 5 + 0.25 * 3) / (0.33 + 0.25) ~ 4,1 Prediction(donkey, hotdog) = (0.25 * 3 + 0.25 * 5) / (0.25 + 0.25) ~ 4 Prediction(donkey, berries) = (0.66 * 3 + 0.25 * 5) / (0.66 + 0.25) ~ 3,6  PartialMultiplyMapper: looking at item [1] looking at user [1] with preference [5.0] similar item [1] with similarity [NaN] similar item [2] with similarity [0.25] similar item [3] with similarity [0.6666667] similar item [4] with similarity [0.5] looking at user [2] with preference [2.0] similar item [1] with similarity [NaN] similar item [2] with similarity [0.25] similar item [3] with similarity [0.6666667] similar item [4] with similarity [0.5] looking at user [4] with preference [3.0] similar item [1] with similarity [NaN] similar item [2] with similarity [0.25] similar item [3] with similarity [0.6666667] similar item [4] with similarity [0.5] PartialMultiplyMapper: looking at item [2] looking at user [1] with preference [5.0] similar item [1] with similarity [0.25] similar item [2] with similarity [NaN] similar item [3] with similarity [0.33333334] similar item [4] with similarity [0.25] looking at user [3] with preference [5.0] similar item [1] with similarity [0.25] similar item [2] with similarity [NaN] similar item [3] with similarity [0.33333334] similar item [4] with similarity [0.25] PartialMultiplyMapper: looking at item [3] looking at user [1] with preference [2.0] similar item [1] with similarity [0.6666667] similar item [2] with similarity [0.33333334] similar item [3] with similarity [NaN] similar item [4] with similarity [0.25] looking at user [2] with preference [3.0] similar item [1] with similarity [0.6666667] similar item [2] with similarity [0.33333334] similar item [3] with similarity [NaN] similar item [4] with similarity [0.25] PartialMultiplyMapper: looking at item [4] looking at user [2] with preference [5.0] similar item [1] with similarity [0.5] similar item [2] with similarity [0.25] similar item [3] with similarity [0.25] similar item [4] with similarity [NaN] looking at user [3] with preference [3.0] similar item [1] with similarity [0.5] similar item [2] with similarity [0.25] similar item [3] with similarity [0.25] similar item [4] with similarity [NaN] looking at user [4] with preference [5.0] similar item [1] with similarity [0.5] similar item [2] with similarity [0.25] similar item [3] with similarity [0.25] similar item [4] with similarity [NaN] AggregateAndRecommendReducer: Computing predictions for user [1] Predicting preference towards [1] adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator adding preference*similarity [1.3333334] to numerator and similarity [0.6666667] to denominator Predicted preference is [NaN] Predicting preference towards [2] adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator adding preference*similarity [0.6666667] to numerator and similarity [0.33333334] to denominator Predicted preference is [NaN] Predicting preference towards [3] adding preference*similarity [3.3333335] to numerator and similarity [0.6666667] to denominator adding preference*similarity [1.6666667] to numerator and similarity [0.33333334] to denominator adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator Predicted preference is [NaN] Predicting preference towards [4] adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator adding preference*similarity [2.5] to numerator and similarity [0.5] to denominator adding preference*similarity [0.5] to numerator and similarity [0.25] to denominator Predicted preference is [4.25] AggregateAndRecommendReducer: Computing predictions for user [2] Predicting preference towards [1] adding preference*similarity [2.0] to numerator and similarity [0.6666667] to denominator adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator adding preference*similarity [2.5] to numerator and similarity [0.5] to denominator Predicted preference is [NaN] Predicting preference towards [2] adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator adding preference*similarity [0.5] to numerator and similarity [0.25] to denominator adding preference*similarity [1.0] to numerator and similarity [0.33333334] to denominator Predicted preference is [3.3] Predicting preference towards [3] adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator adding preference*similarity [1.3333334] to numerator and similarity [0.6666667] to denominator Predicted preference is [NaN] Predicting preference towards [4] adding preference*similarity [1.0] to numerator and similarity [0.5] to denominator adding preference*similarity [0.75] to numerator and similarity [0.25] to denominator adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator Predicted preference is [NaN] AggregateAndRecommendReducer: Computing predictions for user [3] Predicting preference towards [1] adding preference*similarity [1.5] to numerator and similarity [0.5] to denominator adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator Predicted preference is [3.6666667] Predicting preference towards [2] adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator adding preference*similarity [0.75] to numerator and similarity [0.25] to denominator Predicted preference is [NaN] Predicting preference towards [3] adding preference*similarity [1.6666667] to numerator and similarity [0.33333334] to denominator adding preference*similarity [0.75] to numerator and similarity [0.25] to denominator Predicted preference is [4.142857] Predicting preference towards [4] adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator Predicted preference is [NaN] AggregateAndRecommendReducer: Computing predictions for user [4] Predicting preference towards [1] adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator adding preference*similarity [2.5] to numerator and similarity [0.5] to denominator Predicted preference is [NaN] Predicting preference towards [2] adding preference*similarity [0.75] to numerator and similarity [0.25] to denominator adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator Predicted preference is [4.0] Predicting preference towards [3] adding preference*similarity [1.25] to numerator and similarity [0.25] to denominator adding preference*similarity [2.0] to numerator and similarity [0.6666667] to denominator Predicted preference is [3.5454545] Predicting preference towards [4] adding preference*similarity [1.5] to numerator and similarity [0.5] to denominator adding preference*similarity [NaN] to numerator and similarity [NaN] to denominator Predicted preference is [NaN]
Now that I'm looking at the patch I have a number of question. It seems to be changing many key points of the job, and I wish to see that the functionality and optimizations are not being lost.
I'm not fully understanding is handling of NaN. You see what was done before – NaN values in vectors were used to exclude items from recommendation. It's a reasonably nice way to do it. What's the equivalent here? I see other bits of code paying attention to NaN.
Are we handling "boolean" preferences efficiently? Before it would avoid the vectortimespreference step when the pref was known to be 1.0, and I don't see that now.
Finally there is a feature in vectors that will save space, causing it to write float values instead of doubles, since we don't need 64 bits of precision. I also don't see how that's preserved.
Basically I am not yet sure how the new computation is structured from reading the code. I think some comments on the "Aggregate" jobs would be ideal.
It's also a big task to test but my concern is how fast this runs now. I got to about 700 hours CPU for 5.7 million users / 130M ratings and I'm afraid that it can easily go up by orders of magnitude if some of the optimizations aren't here.
The latest patch should work with the head now.
Are there any chances to reduce the number of unique writable objects we're creating? There is some need to specialize and create custom objects for performance though I do see there are starting to be lots of objects that hold one or two primitives and I'm keen to reuse classes if reasonable
This is certainly desirable, yet is seems very difficult to me, especially when we use features like secondary sort which requires very specialized objects. If you see a good starting point on which objects to generalize I'd be ready to put some work into that.
findDeclaredField() and setField()  yeah I see what you're doing though it seems a little fragile to dig inside an object and change its fields. They are just tests, so maybe it's OK, but are there alternatives? Even for tests, if it's private, I think it's not testable myself.
I can see your point here. In an ideal world you would want to write unittests that know nothing about the actual implementation of the class to test and treat it like a "blackbox", giving it some input and then checking the output. However you have to get the class to test into a certain state before it can be tested in a lot of cases. From my experience the best way to achieve this is to make the class ready for dependency injection so it can be configured from outside, yet I don't think this really fits good for MapReduce code. So I thought the easiest way to get control over the classes state for testing was to directly set the private fields which is a very unobtrusive way because the code does not need to be changed just for testing purposes, yet this approach has the drawback of binding the tests directly to the implementation of the classes that are tested. If we want to avoid that we'd have to refactor the code a bit to be more testable I think.
A rather complex example for this would be the AggregateAndRecommendReducer which fills an OpenIntLongHashMap from a SequenceFile when it is setup. In the test for that class, I did not want to create a sequencefile on disk and have it read that, because that would make the test code unreadable and instead of only testing one method (the reduce() method) I would also implicitly have to test the setup method additionally. So I thought the easiest to write and most understandable way would be to create an OpentIntLongHashMap and directly assign that to the private field. Another solution might be to introduce a packageprivate setter method that could be called by the testcode.
I can try to refactor the code to avoid the setField() calls, if you want.
Likewise I don't mind adding more utility classes per se but I prefer to avoid utils/helper classes if the methods can be reasonably attached to another implementation. I haven't looked hard at it, maybe these are necessary, just noting one concern.
I'm not to fond of utility classes either, especially because they are usually called statically. Yet what I dislike more is code duplication and I tried to only move methods into utility classes that are called from at least two different classes. One example why this is crucial is the parsing of lines from preference text files. I saw that in some places only a comma is allowed as delimiter, while other classes also allow a tab. That's a typical example for what happens when code is duplicated and new functionality is introduced, so I thought it'd be far better to move this functionality into a utility class to have exactly one place in the code where the delimiter is specified.
Does this change behavior of the recommender job or is it the same initial input and final output?
Everything stays the same, the only difference is that you have to tell the job which similarity measure to use.
It's all looking reasonably good. I think the patch may need an update to match head as I am getting errors applying it. i bet they are small issues.
I skimmed through it and have a few questions:
findDeclaredField() and setField() – yeah I see what you're doing though it seems a little fragile to dig inside an object and change its fields. They are just tests, so maybe it's OK, but are there alternatives? Even for tests, if it's private, I think it's not testable myself.
Are there any chances to reduce the number of unique writable objects we're creating? There is some need to specialize and create custom objects for performance though I do see there are starting to be lots of objects that hold one or two primitives and I'm keen to reuse classes if reasonable
Likewise I don't mind adding more utility classes per se but I prefer to avoid utils/helper classes if the methods can be reasonably attached to another implementation. I haven't looked hard at it, maybe these are necessary, just noting one concern.
I'll have to look more at the patch when I can apply it and view it in the IDE.
Does this change behavior of the recommender job or is it the same initial input and final output?
I managed to implement all requirements, added unittests for all mappers and reducers participating in the RecommenderJob and wrote a small integration test to check the correctness of the generated recommendations. I'm not completely sure how the changes affect performance, would be great if someone could review the patch and check that.
I'm pretty content with the latest patch too and I think it fits the current coding style best. Let me know when there are changes to be made, I'm willing to invest as much time as needed here.