Mahout
  1. Mahout
  2. MAHOUT-847

Improve Euclidean distance similarity calculation

    Details

      Description

      In the non-distributed recommender world, the Euclidean distance similarity is calculated as n/(1+d), where d is distance and n is dimension. 1/(1+d) is a valid mapping from distance [0,infinity) to similarity (0,1]. n is there to "correct" for the fact that things are farther apart in higher dimensions. It would be right-er, after some discussion, to use a factor of sqrt, and apply directly to the distance; 1/(1+d/sqrt).

      I propose fixing the calculation accordingly.

      In the distributed similarity, the formula is 1-1/(1+d), which is the wrong way around. That will be fixed. I'd apply the same heuristic, except that at the moment we don't have access to the value of n at that point. I don't like the inconsistency but it's minor; would rather get this change in now, which definitely improves things.

        Activity

        Hide
        Lance Norskog added a comment -


        A instance of this class will probably be called with the same 'n' over and over, so it could cache sqrt.

        Show
        Lance Norskog added a comment - A instance of this class will probably be called with the same 'n' over and over, so it could cache sqrt .
        Hide
        Lance Norskog added a comment - - edited

        As a perpetual beginner, it is daunting to learn Mahout. It helps if well-defined math terms are only used when the code matches the (200-year-old) concept.

        Perhaps WarpedEuclideanSimilarity would be more clear?

        Show
        Lance Norskog added a comment - - edited As a perpetual beginner, it is daunting to learn Mahout. It helps if well-defined math terms are only used when the code matches the (200-year-old) concept. Perhaps WarpedEuclideanSimilarity would be more clear?
        Hide
        Sean Owen added a comment -

        The problem with caching sqrt is that every pair of users / vectors may overlap in a different number of dimensions. It's not fixed (or if it were, I wouldn't be correcting for it as it wouldn't change ordering).

        Well, the real Euclidean distance is in there alright. It is not a similarity metric by itself (it gets bigger with less similarity). So there's a transformation to a similarity value, and that's the made-up bit I'm changing here. I think "EuclideanDistance"+"Similarity" is therefore accurate.

        Show
        Sean Owen added a comment - The problem with caching sqrt is that every pair of users / vectors may overlap in a different number of dimensions. It's not fixed (or if it were, I wouldn't be correcting for it as it wouldn't change ordering). Well, the real Euclidean distance is in there alright. It is not a similarity metric by itself (it gets bigger with less similarity). So there's a transformation to a similarity value, and that's the made-up bit I'm changing here. I think "EuclideanDistance"+"Similarity" is therefore accurate.
        Hide
        Lance Norskog added a comment -

        How does this compare with EuclideanDistanceMeasure? Does EDM have the same semantics as the old or new EDSimilarity? If I use one or the other in parallel-structured recommender stacks should I expect roughly the same results? Or, should EDM be rewritten to match?

        Show
        Lance Norskog added a comment - How does this compare with EuclideanDistanceMeasure? Does EDM have the same semantics as the old or new EDSimilarity? If I use one or the other in parallel-structured recommender stacks should I expect roughly the same results? Or, should EDM be rewritten to match?
        Hide
        Sean Owen added a comment -

        No, EuclideanDistanceMeasure is a distance measure rather than similarity measure. That is, it can be bigger than 1 and (obviously) gets bigger as the Euclidean distance grows. It is not changed, and continues to return the usual Euclidean distance. This is not used in recommenders.

        Show
        Sean Owen added a comment - No, EuclideanDistanceMeasure is a distance measure rather than similarity measure. That is, it can be bigger than 1 and (obviously) gets bigger as the Euclidean distance grows. It is not changed, and continues to return the usual Euclidean distance. This is not used in recommenders.

          People

          • Assignee:
            Sean Owen
            Reporter:
            Sean Owen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development