Solr
  1. Solr
  2. SOLR-1302

Fun with Distances - Add Distance functions for a variety of things

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5, 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      There are many distance functions that are useful to have:

      1. Great Circle (lat/lon) and other geo distances
      2. Euclidean (Vector)
      3. Manhattan (Vector)
      4. Cosine (Vector)

      For the vector ones, the idea is that the fields on a document can be used to determine the vector.

      1. SOLR-1302.patch
        27 kB
        Grant Ingersoll
      2. SOLR-1302.patch
        41 kB
        Grant Ingersoll
      3. SOLR-1302.patch
        58 kB
        Grant Ingersoll

        Issue Links

          Activity

          Hide
          Chris A. Mattmann added a comment -

          Hey Grant,

          I'd love to help on this issue – do you have a preference regarding which one of the above 4 that you'd like to work on? All of them? 2 of them?

          Let me know – I'd be happy to write 1 or 2 of them...

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Grant, I'd love to help on this issue – do you have a preference regarding which one of the above 4 that you'd like to work on? All of them? 2 of them? Let me know – I'd be happy to write 1 or 2 of them... Cheers, Chris
          Hide
          Grant Ingersoll added a comment -

          I've got Haversine implemented (great circle), and RadianFunction (convert a val to radians). I'll put up a patch shortly.

          Show
          Grant Ingersoll added a comment - I've got Haversine implemented (great circle), and RadianFunction (convert a val to radians). I'll put up a patch shortly.
          Hide
          Grant Ingersoll added a comment -

          Haversine implementation, RadianFunction and DegreeFunction. Also small refactorings in other places to better support doubles to avoid losing precision for as long as possible.

          Next up: Euclidean and SquaredEuclidean

          Show
          Grant Ingersoll added a comment - Haversine implementation, RadianFunction and DegreeFunction. Also small refactorings in other places to better support doubles to avoid losing precision for as long as possible. Next up: Euclidean and SquaredEuclidean
          Hide
          Grant Ingersoll added a comment -

          Note, we may want to mod this to work more with a Field Type that represents a Point. This depends on SOLR-1131.

          Show
          Grant Ingersoll added a comment - Note, we may want to mod this to work more with a Field Type that represents a Point. This depends on SOLR-1131 .
          Hide
          Grant Ingersoll added a comment -

          Adds in SquaredEuclideanFunction and EuclideanFunction distance functions in addition to the others. Both can calculate the Euclidean distance on an n-dimensional vector made up of the fields of a document. See the unit test (DistanceFunctionTest) for an example.

          Show
          Grant Ingersoll added a comment - Adds in SquaredEuclideanFunction and EuclideanFunction distance functions in addition to the others. Both can calculate the Euclidean distance on an n-dimensional vector made up of the fields of a document. See the unit test (DistanceFunctionTest) for an example.
          Hide
          Grant Ingersoll added a comment -

          No need for all of these classes for one offs of distance. Implement general Lp_Space Vector distance function + a special case for the squared euclidean distance (which isn't really a distance, but is still useful.)

          See http://en.wikipedia.org/wiki/Lp_space

          Euclidean distance is dist(2, valuesources...), while Manhattan is (dist(1, valuesources...). Handles some powers as special cases for improved speed.

          This pretty much enables Solr to do some pretty cool stuff when it comes to vector calculations.

          See the tests for how to use. Will add Wiki later.

          Show
          Grant Ingersoll added a comment - No need for all of these classes for one offs of distance. Implement general Lp_Space Vector distance function + a special case for the squared euclidean distance (which isn't really a distance, but is still useful.) See http://en.wikipedia.org/wiki/Lp_space Euclidean distance is dist(2, valuesources...), while Manhattan is (dist(1, valuesources...). Handles some powers as special cases for improved speed. This pretty much enables Solr to do some pretty cool stuff when it comes to vector calculations. See the tests for how to use. Will add Wiki later.
          Hide
          Grant Ingersoll added a comment -

          I believe this is ready to commit, but am going to sleep on it.

          Show
          Grant Ingersoll added a comment - I believe this is ready to commit, but am going to sleep on it.
          Hide
          Grant Ingersoll added a comment -

          Wiki updated: http://wiki.apache.org/solr/FunctionQuery with documentation on all of these functions. I think this is ready to go, will likely commit tomorrow.

          Show
          Grant Ingersoll added a comment - Wiki updated: http://wiki.apache.org/solr/FunctionQuery with documentation on all of these functions. I think this is ready to go, will likely commit tomorrow.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 836216.

          Show
          Grant Ingersoll added a comment - Committed revision 836216.
          Hide
          Chris Male added a comment -

          Hi Grant,

          What sort of performance do these functions have for calculating the distances for say 100k docs?

          Thanks,

          Show
          Chris Male added a comment - Hi Grant, What sort of performance do these functions have for calculating the distances for say 100k docs? Thanks,
          Hide
          Grant Ingersoll added a comment -

          Still benchmarking. I think there are two other things that are needed for performance:
          1. Filtering (see the frange stuff)
          2. Caching of function results w/in a single search for use between filtering, scoring and sorting

          Show
          Grant Ingersoll added a comment - Still benchmarking. I think there are two other things that are needed for performance: 1. Filtering (see the frange stuff) 2. Caching of function results w/in a single search for use between filtering, scoring and sorting
          Hide
          Grant Ingersoll added a comment -

          That being said, my anecdotal tests show them to be pretty fast over ~68K docs. I'm going to load up the entire Open Street Map planet at some point this week and then I can run real tests.

          Show
          Grant Ingersoll added a comment - That being said, my anecdotal tests show them to be pretty fast over ~68K docs. I'm going to load up the entire Open Street Map planet at some point this week and then I can run real tests.
          Hide
          Grant Ingersoll added a comment -

          Adding geohash Haversine distance requires being able to pass in string literals

          Show
          Grant Ingersoll added a comment - Adding geohash Haversine distance requires being able to pass in string literals
          Hide
          Grant Ingersoll added a comment -

          I'm going to close this, as I've implemented a bunch of them, with the exception of cosine. If someone wants to do that one, they can open a new issue.

          Show
          Grant Ingersoll added a comment - I'm going to close this, as I've implemented a bunch of them, with the exception of cosine. If someone wants to do that one, they can open a new issue.
          Hide
          Grant Ingersoll added a comment -

          Added strdist - string distance, too, which uses the Lucene spell checker distance measures for calculating things like edit distance, etc. between two strings.

          Show
          Grant Ingersoll added a comment - Added strdist - string distance, too, which uses the Lucene spell checker distance measures for calculating things like edit distance, etc. between two strings.
          Hide
          Shalin Shekhar Mangar added a comment -

          Looking over the source of DistanceUtils#vectorDistance, it seems like there is a bug with calculating infinite norm:

          Existing code:

          for (int i = 0; i < vec1.length; i++) {
                  result = Math.max(vec1[i], vec2[i]);
          }
          

          Shouldn't that be:

          for (int i = 0; i < vec1.length; i++) {
                  result = Math.max(result, Math.max(vec1[i], vec2[i]));
          }
          
          Show
          Shalin Shekhar Mangar added a comment - Looking over the source of DistanceUtils#vectorDistance, it seems like there is a bug with calculating infinite norm: Existing code: for ( int i = 0; i < vec1.length; i++) { result = Math .max(vec1[i], vec2[i]); } Shouldn't that be: for ( int i = 0; i < vec1.length; i++) { result = Math .max(result, Math .max(vec1[i], vec2[i])); }
          Hide
          Grant Ingersoll added a comment -

          Yep, good catch Shalin. You want to fix?

          Show
          Grant Ingersoll added a comment - Yep, good catch Shalin. You want to fix?
          Hide
          Shalin Shekhar Mangar added a comment -

          Done. Committed revision 911153.

          Show
          Shalin Shekhar Mangar added a comment - Done. Committed revision 911153.
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development