# Improve Euclidean distance similarity calculation

## Details

• Type: Improvement
• Status: Closed
• Priority: Minor
• Resolution: Fixed
• Affects Version/s: 0.5
• Fix Version/s:
• Component/s:
• Labels:

## Description

In the non-distributed recommender world, the Euclidean distance similarity is calculated as n/(1+d), where d is distance and n is dimension. 1/(1+d) is a valid mapping from distance [0,infinity) to similarity (0,1]. n is there to "correct" for the fact that things are farther apart in higher dimensions. It would be right-er, after some discussion, to use a factor of sqrt, and apply directly to the distance; 1/(1+d/sqrt).

I propose fixing the calculation accordingly.

In the distributed similarity, the formula is 1-1/(1+d), which is the wrong way around. That will be fixed. I'd apply the same heuristic, except that at the moment we don't have access to the value of n at that point. I don't like the inconsistency but it's minor; would rather get this change in now, which definitely improves things.

## Attachments

1. MAHOUT-847.patch
7 kB
Sean Owen

## Activity

Sean Owen created issue -
Field Original Value New Value
Status Open [ 1 ] Patch Available [ 10002 ]
 Attachment MAHOUT-847.patch [ 12499856 ]
Hide
Lance Norskog added a comment -

A instance of this class will probably be called with the same 'n' over and over, so it could cache sqrt.

Show
Lance Norskog added a comment - A instance of this class will probably be called with the same 'n' over and over, so it could cache sqrt .
Hide
Lance Norskog added a comment - - edited

As a perpetual beginner, it is daunting to learn Mahout. It helps if well-defined math terms are only used when the code matches the (200-year-old) concept.

Perhaps WarpedEuclideanSimilarity would be more clear?

Show
Lance Norskog added a comment - - edited As a perpetual beginner, it is daunting to learn Mahout. It helps if well-defined math terms are only used when the code matches the (200-year-old) concept. Perhaps WarpedEuclideanSimilarity would be more clear?
Hide
Sean Owen added a comment -

The problem with caching sqrt is that every pair of users / vectors may overlap in a different number of dimensions. It's not fixed (or if it were, I wouldn't be correcting for it as it wouldn't change ordering).

Well, the real Euclidean distance is in there alright. It is not a similarity metric by itself (it gets bigger with less similarity). So there's a transformation to a similarity value, and that's the made-up bit I'm changing here. I think "EuclideanDistance"+"Similarity" is therefore accurate.

Show
Sean Owen added a comment - The problem with caching sqrt is that every pair of users / vectors may overlap in a different number of dimensions. It's not fixed (or if it were, I wouldn't be correcting for it as it wouldn't change ordering). Well, the real Euclidean distance is in there alright. It is not a similarity metric by itself (it gets bigger with less similarity). So there's a transformation to a similarity value, and that's the made-up bit I'm changing here. I think "EuclideanDistance"+"Similarity" is therefore accurate.
 Status Patch Available [ 10002 ] Resolved [ 5 ] Resolution Fixed [ 1 ]
Hide
Lance Norskog added a comment -

How does this compare with EuclideanDistanceMeasure? Does EDM have the same semantics as the old or new EDSimilarity? If I use one or the other in parallel-structured recommender stacks should I expect roughly the same results? Or, should EDM be rewritten to match?

Show
Lance Norskog added a comment - How does this compare with EuclideanDistanceMeasure? Does EDM have the same semantics as the old or new EDSimilarity? If I use one or the other in parallel-structured recommender stacks should I expect roughly the same results? Or, should EDM be rewritten to match?
Hide
Sean Owen added a comment -

No, EuclideanDistanceMeasure is a distance measure rather than similarity measure. That is, it can be bigger than 1 and (obviously) gets bigger as the Euclidean distance grows. It is not changed, and continues to return the usual Euclidean distance. This is not used in recommenders.

Show
Sean Owen added a comment - No, EuclideanDistanceMeasure is a distance measure rather than similarity measure. That is, it can be bigger than 1 and (obviously) gets bigger as the Euclidean distance grows. It is not changed, and continues to return the usual Euclidean distance. This is not used in recommenders.
 Status Resolved [ 5 ] Closed [ 6 ]

## People

• Assignee:
Sean Owen
Reporter:
Sean Owen
0 Vote for this issue
Watchers:
0 Start watching this issue

## Dates

• Due:
Created:
Updated:
Resolved: