Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4
    • Labels:
      None

      Description

      org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only computes the cosine distance between those components of the vectors where both vectors have a value greater zero.

      This is inconsistent with the definition of the cosine (correct me if I'm wrong) and is inconsistent with the distributed cosine similarity computation.

      1. MAHOUT-389-3.patch
        53 kB
        Sebastian Schelter
      2. MAHOUT-389-2.patch
        53 kB
        Sebastian Schelter
      3. MAHOUT-389.patch
        2 kB
        Sebastian Schelter

        Activity

        Sebastian Schelter created issue -
        Hide
        Sebastian Schelter added a comment -

        Unit test attached.

        Show
        Sebastian Schelter added a comment - Unit test attached.
        Sebastian Schelter made changes -
        Field Original Value New Value
        Attachment MAHOUT-389.patch [ 12443585 ]
        Hide
        Sean Owen added a comment -

        The issue with that is you are implicitly assuming that missing ratings are 0, and that's not a good assumption. For example, if your ratings are on a scale of 1 to 5, you are assuming that all unknown items are utterly hated.

        It is not ignoring zeroes; it is ignoring missing ratings.

        I think the distributed computation should act the same way for similar reasons.

        Show
        Sean Owen added a comment - The issue with that is you are implicitly assuming that missing ratings are 0, and that's not a good assumption. For example, if your ratings are on a scale of 1 to 5, you are assuming that all unknown items are utterly hated. It is not ignoring zeroes; it is ignoring missing ratings. I think the distributed computation should act the same way for similar reasons.
        Hide
        Sean Owen added a comment -

        I should add you could emulate the behavior you may desire in the non-distributed implementation with a combination of the UncenteredCosineSimilarity, and a PreferenceInferrer which always infers 0.

        I don't see a simple way to make the distributed version emulate the non-distributed version.

        While I think the non-distributed version has preferable semantics, neither is really "wrong". I suggesting leaving this as more of an improvement idea to figure out a way to unify them, by modifying the distributed version.

        Show
        Sean Owen added a comment - I should add you could emulate the behavior you may desire in the non-distributed implementation with a combination of the UncenteredCosineSimilarity, and a PreferenceInferrer which always infers 0. I don't see a simple way to make the distributed version emulate the non-distributed version. While I think the non-distributed version has preferable semantics, neither is really "wrong". I suggesting leaving this as more of an improvement idea to figure out a way to unify them, by modifying the distributed version.
        Sean Owen made changes -
        Issue Type Bug [ 1 ] Improvement [ 4 ]
        Priority Major [ 3 ] Minor [ 4 ]
        Hide
        Sebastian Schelter added a comment -

        There might be cases where it makes sense to look not only at co-ratings, e.g. imagine you have 3 products: A, B and C

        Let's say the pairs A,B and A,C have the same co-ratings (the same users bought them), but B is a topseller, which is bought by lots of people and C is a niche product, which only sells rarely.
        A cosine which includes the zero assumption would decrease the value for the topseller and prefer the niche product, which might be a good thing depending on your use case.

        But I definitely see your point here that the assumption is generally not holding and I also think that the distributed version should be modified.

        I attached a patch with a first proposal how this could be managed.

        I tried to refactor the similarity computation out of the map-reduce code and make it possible to implement different similarity functions that have to follow this scheme:

        • in a early stage of the process, the similarity implementation can compute a weight (a single double) for each item-vector
        • in the end, it is given all co-ratings and the previously computed weights for each item-pair that has at least one co-rating

        That should be sufficient to compute centered pearson-correlation as well as cosine or tanimoto coefficients.

        I hope it's understandable what I'm trying to propose here, taking a look at org.apache.mahout.cf.taste.hadoop.similarity.DistributedSimilarity together with DistributedPearsonCorrelationSimilarity and DistributedUncenteredZeroAssumingCosineSimilarity will hopefully help to get a clearer picture. These implementations are merely for demonstration purposes, they could be merged with the already existing non-distributed implementations in case you like the approach described here.

        Show
        Sebastian Schelter added a comment - There might be cases where it makes sense to look not only at co-ratings, e.g. imagine you have 3 products: A, B and C Let's say the pairs A,B and A,C have the same co-ratings (the same users bought them), but B is a topseller, which is bought by lots of people and C is a niche product, which only sells rarely. A cosine which includes the zero assumption would decrease the value for the topseller and prefer the niche product, which might be a good thing depending on your use case. But I definitely see your point here that the assumption is generally not holding and I also think that the distributed version should be modified. I attached a patch with a first proposal how this could be managed. I tried to refactor the similarity computation out of the map-reduce code and make it possible to implement different similarity functions that have to follow this scheme: in a early stage of the process, the similarity implementation can compute a weight (a single double) for each item-vector in the end, it is given all co-ratings and the previously computed weights for each item-pair that has at least one co-rating That should be sufficient to compute centered pearson-correlation as well as cosine or tanimoto coefficients. I hope it's understandable what I'm trying to propose here, taking a look at org.apache.mahout.cf.taste.hadoop.similarity.DistributedSimilarity together with DistributedPearsonCorrelationSimilarity and DistributedUncenteredZeroAssumingCosineSimilarity will hopefully help to get a clearer picture. These implementations are merely for demonstration purposes, they could be merged with the already existing non-distributed implementations in case you like the approach described here.
        Sebastian Schelter made changes -
        Attachment MAHOUT-389-2.patch [ 12443707 ]
        Hide
        Sebastian Schelter added a comment -

        MAHOUT-389-2.patch had some debug code not removed, sorry, this patch should be clean

        Show
        Sebastian Schelter added a comment - MAHOUT-389 -2.patch had some debug code not removed, sorry, this patch should be clean
        Sebastian Schelter made changes -
        Attachment MAHOUT-389-3.patch [ 12443709 ]
        Hide
        Sean Owen added a comment -

        De-emphasizing common items is often desirable, though achieving it this way may be a case of mixing two things that can be separable. That is you can more directly de-emphasize with, for example, the provided InverseUserFrequency transformation.

        But doubtless there are times when this version of the computation is better, even best. For example when your rating is really the number of times a user has viewed an item, then it's absolutely correct to assume missing items are rated 0, because they really are 0, and so adding that information is even better. You could also get this effect with a PreferenceInferrer. (See I've thought of everything.)

        But I don't really mind adding more variations on the similarity computation, to give people options. As long as it fits in nicely, doesn't really complicate the code or confuse readers, and doesn't have any material performance impact, why not? I will look at the patch a bit later and see what you're up to.

        Show
        Sean Owen added a comment - De-emphasizing common items is often desirable, though achieving it this way may be a case of mixing two things that can be separable. That is you can more directly de-emphasize with, for example, the provided InverseUserFrequency transformation. But doubtless there are times when this version of the computation is better, even best. For example when your rating is really the number of times a user has viewed an item, then it's absolutely correct to assume missing items are rated 0, because they really are 0, and so adding that information is even better. You could also get this effect with a PreferenceInferrer. (See I've thought of everything.) But I don't really mind adding more variations on the similarity computation, to give people options. As long as it fits in nicely, doesn't really complicate the code or confuse readers, and doesn't have any material performance impact, why not? I will look at the patch a bit later and see what you're up to.
        Hide
        Sean Owen added a comment -

        I'm happy to commit this since it looks fine, suits your purposes as an active user, and remains reusable. I might make minor changes like not making CoRating an inner class of an interface, and just making it Writable rather than separate the Writable wrapper.

        Show
        Sean Owen added a comment - I'm happy to commit this since it looks fine, suits your purposes as an active user, and remains reusable. I might make minor changes like not making CoRating an inner class of an interface, and just making it Writable rather than separate the Writable wrapper.
        Hide
        Sebastian Schelter added a comment -

        I'm glad you like it, refactor it as you wish!

        This patch should make it possible for mahout users to use their own customized similarity implementation for the distributed computation

        Show
        Sebastian Schelter added a comment - I'm glad you like it, refactor it as you wish! This patch should make it possible for mahout users to use their own customized similarity implementation for the distributed computation
        Hide
        Sean Owen added a comment -

        Committed patch #3 with some largely cosmetic style changes

        Show
        Sean Owen added a comment - Committed patch #3 with some largely cosmetic style changes
        Sean Owen made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Sean Owen [ srowen ]
        Fix Version/s 0.4 [ 12314396 ]
        Resolution Fixed [ 1 ]
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sebastian Schelter
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development