# Fun with Distances - Add Distance functions for a variety of things

## Details

• Type: New Feature
• Status: Closed
• Priority: Minor
• Resolution: Fixed
• Affects Version/s: None
• Fix Version/s:
• Component/s: None
• Labels:
None

## Description

There are many distance functions that are useful to have:

1. Great Circle (lat/lon) and other geo distances
2. Euclidean (Vector)
3. Manhattan (Vector)
4. Cosine (Vector)

For the vector ones, the idea is that the fields on a document can be used to determine the vector.

## Attachments

1. SOLR-1302.patch
58 kB
Grant Ingersoll
2. SOLR-1302.patch
41 kB
Grant Ingersoll
3. SOLR-1302.patch
27 kB
Grant Ingersoll

## Activity

Hide
Grant Ingersoll added a comment -

Bulk close for 3.1.0 release

Show
Grant Ingersoll added a comment - Bulk close for 3.1.0 release
Hide
Hoss Man added a comment -

Correcting Fix Version based on CHANGES.txt, see this thread for more details...

Show
Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
Hide
Shalin Shekhar Mangar added a comment -

Done. Committed revision 911153.

Show
Shalin Shekhar Mangar added a comment - Done. Committed revision 911153.
Hide
Grant Ingersoll added a comment -

Yep, good catch Shalin. You want to fix?

Show
Grant Ingersoll added a comment - Yep, good catch Shalin. You want to fix?
Hide
Shalin Shekhar Mangar added a comment -

Looking over the source of DistanceUtils#vectorDistance, it seems like there is a bug with calculating infinite norm:

Existing code:

```for (int i = 0; i < vec1.length; i++) {
result = Math.max(vec1[i], vec2[i]);
}
```

Shouldn't that be:

```for (int i = 0; i < vec1.length; i++) {
result = Math.max(result, Math.max(vec1[i], vec2[i]));
}
```
Show
Shalin Shekhar Mangar added a comment - Looking over the source of DistanceUtils#vectorDistance, it seems like there is a bug with calculating infinite norm: Existing code: for ( int i = 0; i < vec1.length; i++) { result = Math .max(vec1[i], vec2[i]); } Shouldn't that be: for ( int i = 0; i < vec1.length; i++) { result = Math .max(result, Math .max(vec1[i], vec2[i])); }
Hide
Grant Ingersoll added a comment -

Added strdist - string distance, too, which uses the Lucene spell checker distance measures for calculating things like edit distance, etc. between two strings.

Show
Grant Ingersoll added a comment - Added strdist - string distance, too, which uses the Lucene spell checker distance measures for calculating things like edit distance, etc. between two strings.
Hide
Grant Ingersoll added a comment -

I'm going to close this, as I've implemented a bunch of them, with the exception of cosine. If someone wants to do that one, they can open a new issue.

Show
Grant Ingersoll added a comment - I'm going to close this, as I've implemented a bunch of them, with the exception of cosine. If someone wants to do that one, they can open a new issue.
Hide
Grant Ingersoll added a comment -

Adding geohash Haversine distance requires being able to pass in string literals

Show
Grant Ingersoll added a comment - Adding geohash Haversine distance requires being able to pass in string literals
Hide
Grant Ingersoll added a comment -

That being said, my anecdotal tests show them to be pretty fast over ~68K docs. I'm going to load up the entire Open Street Map planet at some point this week and then I can run real tests.

Show
Grant Ingersoll added a comment - That being said, my anecdotal tests show them to be pretty fast over ~68K docs. I'm going to load up the entire Open Street Map planet at some point this week and then I can run real tests.
Hide
Grant Ingersoll added a comment -

Still benchmarking. I think there are two other things that are needed for performance:
1. Filtering (see the frange stuff)
2. Caching of function results w/in a single search for use between filtering, scoring and sorting

Show
Grant Ingersoll added a comment - Still benchmarking. I think there are two other things that are needed for performance: 1. Filtering (see the frange stuff) 2. Caching of function results w/in a single search for use between filtering, scoring and sorting
Hide
Chris Male added a comment -

Hi Grant,

What sort of performance do these functions have for calculating the distances for say 100k docs?

Thanks,

Show
Chris Male added a comment - Hi Grant, What sort of performance do these functions have for calculating the distances for say 100k docs? Thanks,
Hide
Grant Ingersoll added a comment -

Committed revision 836216.

Show
Grant Ingersoll added a comment - Committed revision 836216.
Hide
Grant Ingersoll added a comment -

Wiki updated: http://wiki.apache.org/solr/FunctionQuery with documentation on all of these functions. I think this is ready to go, will likely commit tomorrow.

Show
Grant Ingersoll added a comment - Wiki updated: http://wiki.apache.org/solr/FunctionQuery with documentation on all of these functions. I think this is ready to go, will likely commit tomorrow.
Hide
Grant Ingersoll added a comment -

I believe this is ready to commit, but am going to sleep on it.

Show
Grant Ingersoll added a comment - I believe this is ready to commit, but am going to sleep on it.
Hide
Grant Ingersoll added a comment -

No need for all of these classes for one offs of distance. Implement general Lp_Space Vector distance function + a special case for the squared euclidean distance (which isn't really a distance, but is still useful.)

Euclidean distance is dist(2, valuesources...), while Manhattan is (dist(1, valuesources...). Handles some powers as special cases for improved speed.

This pretty much enables Solr to do some pretty cool stuff when it comes to vector calculations.

See the tests for how to use. Will add Wiki later.

Show
Grant Ingersoll added a comment - No need for all of these classes for one offs of distance. Implement general Lp_Space Vector distance function + a special case for the squared euclidean distance (which isn't really a distance, but is still useful.) See http://en.wikipedia.org/wiki/Lp_space Euclidean distance is dist(2, valuesources...), while Manhattan is (dist(1, valuesources...). Handles some powers as special cases for improved speed. This pretty much enables Solr to do some pretty cool stuff when it comes to vector calculations. See the tests for how to use. Will add Wiki later.
Hide
Grant Ingersoll added a comment -

Adds in SquaredEuclideanFunction and EuclideanFunction distance functions in addition to the others. Both can calculate the Euclidean distance on an n-dimensional vector made up of the fields of a document. See the unit test (DistanceFunctionTest) for an example.

Show
Grant Ingersoll added a comment - Adds in SquaredEuclideanFunction and EuclideanFunction distance functions in addition to the others. Both can calculate the Euclidean distance on an n-dimensional vector made up of the fields of a document. See the unit test (DistanceFunctionTest) for an example.
Hide
Grant Ingersoll added a comment -

Note, we may want to mod this to work more with a Field Type that represents a Point. This depends on SOLR-1131.

Show
Grant Ingersoll added a comment - Note, we may want to mod this to work more with a Field Type that represents a Point. This depends on SOLR-1131 .
Hide
Grant Ingersoll added a comment -

Haversine implementation, RadianFunction and DegreeFunction. Also small refactorings in other places to better support doubles to avoid losing precision for as long as possible.

Next up: Euclidean and SquaredEuclidean

Show
Grant Ingersoll added a comment - Haversine implementation, RadianFunction and DegreeFunction. Also small refactorings in other places to better support doubles to avoid losing precision for as long as possible. Next up: Euclidean and SquaredEuclidean
Hide
Grant Ingersoll added a comment -

I've got Haversine implemented (great circle), and RadianFunction (convert a val to radians). I'll put up a patch shortly.

Show
Grant Ingersoll added a comment - I've got Haversine implemented (great circle), and RadianFunction (convert a val to radians). I'll put up a patch shortly.
Hide
Chris A. Mattmann added a comment -

Hey Grant,

I'd love to help on this issue – do you have a preference regarding which one of the above 4 that you'd like to work on? All of them? 2 of them?

Let me know – I'd be happy to write 1 or 2 of them...

Cheers,
Chris

Show
Chris A. Mattmann added a comment - Hey Grant, I'd love to help on this issue – do you have a preference regarding which one of the above 4 that you'd like to work on? All of them? 2 of them? Let me know – I'd be happy to write 1 or 2 of them... Cheers, Chris

## People

• Assignee:
Grant Ingersoll
Reporter:
Grant Ingersoll