Mahout
  1. Mahout
  2. MAHOUT-902

TanimotoCoefficientSimilarity should return Double.NaN for two items that have zero overlap

    Details

      Description

      org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity should return Double.NaN for two items that have zero overlap. Please also provide a unit-test for this.

      1. MAHOUT-902.patch
        3 kB
        Remi Melisson

        Activity

        Show
        Remi Melisson added a comment - Hi, Maybe, I misunderstand the description but it seems to be the case : (line 77) http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.java?view=markup tested by : (testNoCorrelation) http://svn.apache.org/viewvc/mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarityTest.java?view=markup Correct me if I'm wrong.
        Hide
        Sean Owen added a comment -

        Sebastian is this turned around? yes the non-distributed implementation already has this behavior. I understood from your message that you proposed to make the distributed version return the same thing.

        I think it's as simple as making its similarity computation into:

        return dots == 0 ? Double.NaN : dots / (normA + normB - dots);

        Show
        Sean Owen added a comment - Sebastian is this turned around? yes the non-distributed implementation already has this behavior. I understood from your message that you proposed to make the distributed version return the same thing. I think it's as simple as making its similarity computation into: return dots == 0 ? Double.NaN : dots / (normA + normB - dots);
        Hide
        Sebastian Schelter added a comment -

        @Remi this is for the user-user similarity computation only.

        @Sean Yes, this change is trivial. I thought it would make a nice first ticket for people who want to start contributing to Mahout (that why its marked with MAHOUT_INTRO_CONTRIBUTE)

        Show
        Sebastian Schelter added a comment - @Remi this is for the user-user similarity computation only. @Sean Yes, this change is trivial. I thought it would make a nice first ticket for people who want to start contributing to Mahout (that why its marked with MAHOUT_INTRO_CONTRIBUTE)
        Hide
        Sean Owen added a comment -

        Ah, right it affects the item-item computation in the non-distributed version too. Well I have a patch ready. I'll wait a bit until I post the 'answer'

        Show
        Sean Owen added a comment - Ah, right it affects the item-item computation in the non-distributed version too. Well I have a patch ready. I'll wait a bit until I post the 'answer'
        Hide
        Remi Melisson added a comment -

        A small patch for this issue, with the corresponding test case.
        Hope it's ok.

        Show
        Remi Melisson added a comment - A small patch for this issue, with the corresponding test case. Hope it's ok.
        Hide
        Sean Owen added a comment -

        That's fine, I think we also need to change the distributed code too, which is the one-liner I mentioned above. If there are no objections I'll put it all together and commit.

        Show
        Sean Owen added a comment - That's fine, I think we also need to change the distributed code too, which is the one-liner I mentioned above. If there are no objections I'll put it all together and commit.
        Hide
        Sean Owen added a comment -

        Tests pass, and this looks like a good fix, and no controversy, and has a patch from the interested party, so I just committed.

        Show
        Sean Owen added a comment - Tests pass, and this looks like a good fix, and no controversy, and has a patch from the interested party, so I just committed.
        Hide
        Remi Melisson added a comment -

        cool

        Show
        Remi Melisson added a comment - cool
        Hide
        Sebastian Schelter added a comment -

        We must not change the distributed code. In the distributed case, 0 is equivalent to non-existent and will already be ignored by iterateNonZero().

        Show
        Sebastian Schelter added a comment - We must not change the distributed code. In the distributed case, 0 is equivalent to non-existent and will already be ignored by iterateNonZero().
        Hide
        Sean Owen added a comment -

        Ah I think I misunderstood what was to change from the start. I'll revert that bit.

        Show
        Sean Owen added a comment - Ah I think I misunderstood what was to change from the start. I'll revert that bit.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1226 (See https://builds.apache.org/job/Mahout-Quality/1226/)
        MAHOUT-902 item similarity is now NaN for no overlap

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1210544
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.java
        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/TanimotoCoefficientSimilarity.java
        • /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarityTest.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1226 (See https://builds.apache.org/job/Mahout-Quality/1226/ ) MAHOUT-902 item similarity is now NaN for no overlap srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1210544 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.java /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/TanimotoCoefficientSimilarity.java /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarityTest.java
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1227 (See https://builds.apache.org/job/Mahout-Quality/1227/)
        MAHOUT-902 oops do not return NaN from distributed item-item simliarity as 0 means 'ignore'

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1210603
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/TanimotoCoefficientSimilarity.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1227 (See https://builds.apache.org/job/Mahout-Quality/1227/ ) MAHOUT-902 oops do not return NaN from distributed item-item simliarity as 0 means 'ignore' srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1210603 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/TanimotoCoefficientSimilarity.java

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sebastian Schelter
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development