There might be cases where it makes sense to look not only at co-ratings, e.g. imagine you have 3 products: A, B and C
Let's say the pairs A,B and A,C have the same co-ratings (the same users bought them), but B is a topseller, which is bought by lots of people and C is a niche product, which only sells rarely.
A cosine which includes the zero assumption would decrease the value for the topseller and prefer the niche product, which might be a good thing depending on your use case.
But I definitely see your point here that the assumption is generally not holding and I also think that the distributed version should be modified.
I attached a patch with a first proposal how this could be managed.
I tried to refactor the similarity computation out of the map-reduce code and make it possible to implement different similarity functions that have to follow this scheme:
- in a early stage of the process, the similarity implementation can compute a weight (a single double) for each item-vector
- in the end, it is given all co-ratings and the previously computed weights for each item-pair that has at least one co-rating
That should be sufficient to compute centered pearson-correlation as well as cosine or tanimoto coefficients.
I hope it's understandable what I'm trying to propose here, taking a look at org.apache.mahout.cf.taste.hadoop.similarity.DistributedSimilarity together with DistributedPearsonCorrelationSimilarity and DistributedUncenteredZeroAssumingCosineSimilarity will hopefully help to get a clearer picture. These implementations are merely for demonstration purposes, they could be merged with the already existing non-distributed implementations in case you like the approach described here.
Committed patch #3 with some largely cosmetic style changes