Details

Type: New Feature

Status: Resolved

Priority: Major

Resolution: Fixed

Affects Version/s: None

Component/s: modules/other

Labels:None

Lucene Fields:New, Patch Available
Description
Adding this static function makes it really easy to incorporate distance with the score or other signals in arbitrary ways, e.g. score / (1 + log(distance)) or whatever.

 LUCENE5258.patch
 22 kB
 Robert Muir
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Commit 1529685 from Robert Muir in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1529685 ]
LUCENE5258: add distance function to expressions/
Really cool Rob! It wasn't obvious in the issue description but after looking at your patch, I see that you bring in here some other opensource implementations of cosine and sine that are not quite as accurate but run much faster. I might find a way to use these routines in LuceneSpatial / Spatial4j.
I know that it is all fashionable to dis on the Java cos and sin functions, but really... this code can never generate an angle greater than pi and yet it uses code that falls through to standard cos for anything smaller than MAX_INT * 1e9.
Did you test to see if this actually helps with the speed?
Also, isn't the dot product formulation for distance faster and simpler in the first place? I thought that the point of the spherical geometry formulations was that they allowed direct use of log trig tables in the bad old days of sextants and paper tables.
I know that it is all fashionable to dis on the Java cos and sin functions, but really... this code can never generate an angle greater than pi and yet it uses code that falls through to standard cos for anything smaller than MAX_INT * 1e9.
Hi Ted: I think you are looking at the logic backwards? it only falls thru for large values.
Did you test to see if this actually helps with the speed?
Of course: using thousands of queries.
Hi Ted: I think you are looking at the logic backwards? it only falls thru for large values.
Oops. You are right. I read it backwards. It is early in the morning here. (or something)
Also, isn't the dot product formulation for distance faster and simpler in the first place? I thought that the point of the spherical geometry formulations was that they allowed direct use of log trig tables in the bad old days of sextants and paper tables.
Hmm at least with the naive formulation, its no faster at all? acos vs asin or whatever to solve for distance? Unless you have a trick?
So I whipped up a quick test and find that the dot product form seems a bit slower.
I also tested with Math.cos and Math.asin and find that haversin runs almost twice faster if you avoid the fancy "fast" math stuff (at least on my laptop). The actual times are about 1.6 microseconds versus about .85 microseconds. Your mileage will vary on microbenchmarks, of course.
Can you share the code to your microbenchmark? I didnt write any microbenchmarks: I just use search queries, I definitely see the opposite.
I also tested with Math.cos and Math.asin and find that haversin runs almost twice faster if you avoid the fancy "fast" math stuff (at least on my laptop). The actual times are about 1.6 microseconds versus about .85 microseconds. Your mileage will vary on microbenchmarks, of course.
Laptop with smaller CPU cache?
All is possible.
I build up a more principled test using caliper and now see the Fast.* methods coming in about 30% faster. Dot method is slower.
Benchmark results at https://microbenchmarks.appspot.com/runs/b79313c0bbc045bf9f968ae5ff0bac77
Code for the benchmark is at https://gist.github.com/tdunning/6881908
It occurs to me that you don't need the final inverse trig function if you only want to rank distances.
So if you want speed, peel that baby off. Results for comparison at https://microbenchmarks.appspot.com/runs/43b5e3b95d1b4a6c9aafd4834c82523d
I build up a more principled test using caliper and now see the Fast.* methods coming in about 30% faster. Dot method is slower.
Hi Ted: thank you for spending the time to do the caliper benchmarks!
It occurs to me that you don't need the final inverse trig function if you only want to rank distances.
I absolutely agree. I do think its handy to get the real distance back in the TopFieldDocs from this function, and I think honestly this function is often wayoverkill especially if you combine with many other scoring factors?
I know Uwe has some stuff here like optimized flatearth formulas, I hope he adds it in the future.
I absolutely agree. I do think its handy to get the real distance back in the TopFieldDocs from this function, and I think honestly this function is often wayoverkill especially if you combine with many other scoring factors?
Well, since sin(epsilon) = epsilon, losing the asin doesn't actually hurt the accuracy much at the scale of hundreds of kilometers. At the scale of thousands of km, what you get is the distance straight through the earth. That is OK in lots of cases, especially where you are sorting results by distance.
Combining distance with other scores doesn't ever make much sense unless you are using distance as a boolean cutoff. In that case, it makes sense to add a cutoff argument to the distance function and apply the sin to the cutoff instead of applying asin to every distance.
I'll buy it (or rather, let Uwe argue it, since i know he combines distance with score), but within the context of this functionality its not so trivial to integrate such cutoff optimizations: as the JIRA here is not to make this accessible via java code but instead from the context of an arbitrary javascriptlike expression?
One possible way is if we change our function list from Map<String,Method> to Map<String,Method[]> so that you can have overloaded functions? This would allow us to allow this function to have an "optional" argument with a cutoff as Ted suggests...
A new name suffices as well. For instance: closerThan(double)
Sent from my iPhone
I think that can work. Maybe we can specialize compareBottom() in some way to use it, so that necessary work is only done when the hit stands a chance of being competitive to the topN...
I Robert, hi Ted,
if I have some time later, I will post another "sloppy" distance function, which is still almost correct (also near the poles) and works very good for distances up to 500 km: "Polar coordinate flatEarth formula" (see http://en.wikipedia.org/wiki/Geographical_distance#Polar_coordinate_flatEarth_formula)
This one only needs one cosinus and a square root, which can be the table lookup and the native sqrt() processor instruction.
This formula is perfect for scoring. If you score and just multiply someting like 1/distance or 1/ln(distance) to your score, the precision is not really important. We use the above formula for that. By that it is possible to find all places around some coordinate, but still have better matching (similarity wise) results score higher, although they are far more away and vice versa (my favourite example: if user searches for "vegetarian pizza", a pizzaria with the name "veggi pizza place" should score higher than "meat pizza place", although more far away). As our scoring factors are floats and the norms are just 8 bit floats, the distance can be very simplified. In our case, we just needed a good distance also near the poles, so the polar coordinate flat earth formula is perfect (and still very correct, also near poles).
In addition, if you multiply the score like score/ln(distance), you can remove the sqrt from the formula, too, because ln() makes just a factor out of it! This is perfect for similarity matches combined with distance.
Hi Robert,
this is a great patch!!
Maybe it's possible to set the scale value (TO_KILOMETERS) of the haversin function via argument, so it can also be used with units different from kilometers?
Commit 1529679 from Robert Muir in branch 'dev/trunk'
[ https://svn.apache.org/r1529679 ]
LUCENE5258: add distance function to expressions/