Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, 5.0
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Adding this static function makes it really easy to incorporate distance with the score or other signals in arbitrary ways, e.g. score / (1 + log(distance)) or whatever.

        Activity

        Hide
        Florian Schilling added a comment -

        Hi Robert,

        this is a great patch!!
        Maybe it's possible to set the scale value (TO_KILOMETERS) of the haversin function via argument, so it can also be used with units different from kilometers?

        Show
        Florian Schilling added a comment - Hi Robert, this is a great patch!! Maybe it's possible to set the scale value (TO_KILOMETERS) of the haversin function via argument, so it can also be used with units different from kilometers?
        Hide
        Uwe Schindler added a comment - - edited

        I Robert, hi Ted,
        if I have some time later, I will post another "sloppy" distance function, which is still almost correct (also near the poles) and works very good for distances up to 500 km: "Polar coordinate flat-Earth formula" (see http://en.wikipedia.org/wiki/Geographical_distance#Polar_coordinate_flat-Earth_formula)

        This one only needs one cosinus and a square root, which can be the table lookup and the native sqrt() processor instruction.

        This formula is perfect for scoring. If you score and just multiply someting like 1/distance or 1/ln(distance) to your score, the precision is not really important. We use the above formula for that. By that it is possible to find all places around some coordinate, but still have better matching (similarity wise) results score higher, although they are far more away and vice versa (my favourite example: if user searches for "vegetarian pizza", a pizzaria with the name "veggi pizza place" should score higher than "meat pizza place", although more far away). As our scoring factors are floats and the norms are just 8 bit floats, the distance can be very simplified. In our case, we just needed a good distance also near the poles, so the polar coordinate flat earth formula is perfect (and still very correct, also near poles).

        In addition, if you multiply the score like score/ln(distance), you can remove the sqrt from the formula, too, because ln() makes just a factor out of it! This is perfect for similarity matches combined with distance.

        Show
        Uwe Schindler added a comment - - edited I Robert, hi Ted, if I have some time later, I will post another "sloppy" distance function, which is still almost correct (also near the poles) and works very good for distances up to 500 km: "Polar coordinate flat-Earth formula" (see http://en.wikipedia.org/wiki/Geographical_distance#Polar_coordinate_flat-Earth_formula ) This one only needs one cosinus and a square root, which can be the table lookup and the native sqrt() processor instruction. This formula is perfect for scoring. If you score and just multiply someting like 1/distance or 1/ln(distance) to your score, the precision is not really important. We use the above formula for that. By that it is possible to find all places around some coordinate, but still have better matching (similarity wise) results score higher, although they are far more away and vice versa (my favourite example: if user searches for "vegetarian pizza", a pizzaria with the name "veggi pizza place" should score higher than "meat pizza place", although more far away). As our scoring factors are floats and the norms are just 8 bit floats, the distance can be very simplified. In our case, we just needed a good distance also near the poles, so the polar coordinate flat earth formula is perfect (and still very correct, also near poles). In addition, if you multiply the score like score/ln(distance), you can remove the sqrt from the formula, too, because ln() makes just a factor out of it! This is perfect for similarity matches combined with distance.
        Hide
        Robert Muir added a comment -

        I think that can work. Maybe we can specialize compareBottom() in some way to use it, so that necessary work is only done when the hit stands a chance of being competitive to the top-N...

        Show
        Robert Muir added a comment - I think that can work. Maybe we can specialize compareBottom() in some way to use it, so that necessary work is only done when the hit stands a chance of being competitive to the top-N...
        Hide
        Ted Dunning added a comment -

        A new name suffices as well. For instance: closerThan(double)

        Sent from my iPhone

        Show
        Ted Dunning added a comment - A new name suffices as well. For instance: closerThan(double) Sent from my iPhone
        Hide
        Robert Muir added a comment -

        One possible way is if we change our function list from Map<String,Method> to Map<String,Method[]> so that you can have overloaded functions? This would allow us to allow this function to have an "optional" argument with a cutoff as Ted suggests...

        Show
        Robert Muir added a comment - One possible way is if we change our function list from Map<String,Method> to Map<String,Method[]> so that you can have overloaded functions? This would allow us to allow this function to have an "optional" argument with a cutoff as Ted suggests...
        Hide
        Robert Muir added a comment -

        I'll buy it (or rather, let Uwe argue it, since i know he combines distance with score), but within the context of this functionality its not so trivial to integrate such cutoff optimizations: as the JIRA here is not to make this accessible via java code but instead from the context of an arbitrary javascript-like expression?

        Show
        Robert Muir added a comment - I'll buy it (or rather, let Uwe argue it, since i know he combines distance with score), but within the context of this functionality its not so trivial to integrate such cutoff optimizations: as the JIRA here is not to make this accessible via java code but instead from the context of an arbitrary javascript-like expression?
        Hide
        Ted Dunning added a comment -

        I absolutely agree. I do think its handy to get the real distance back in the TopFieldDocs from this function, and I think honestly this function is often way-overkill especially if you combine with many other scoring factors?

        Well, since sin(epsilon) = epsilon, losing the asin doesn't actually hurt the accuracy much at the scale of hundreds of kilometers. At the scale of thousands of km, what you get is the distance straight through the earth. That is OK in lots of cases, especially where you are sorting results by distance.

        Combining distance with other scores doesn't ever make much sense unless you are using distance as a boolean cutoff. In that case, it makes sense to add a cutoff argument to the distance function and apply the sin to the cutoff instead of applying asin to every distance.

        Show
        Ted Dunning added a comment - I absolutely agree. I do think its handy to get the real distance back in the TopFieldDocs from this function, and I think honestly this function is often way-overkill especially if you combine with many other scoring factors? Well, since sin(epsilon) = epsilon, losing the asin doesn't actually hurt the accuracy much at the scale of hundreds of kilometers. At the scale of thousands of km, what you get is the distance straight through the earth. That is OK in lots of cases, especially where you are sorting results by distance. Combining distance with other scores doesn't ever make much sense unless you are using distance as a boolean cutoff. In that case, it makes sense to add a cutoff argument to the distance function and apply the sin to the cutoff instead of applying asin to every distance.
        Hide
        Robert Muir added a comment -

        I build up a more principled test using caliper and now see the Fast.* methods coming in about 30% faster. Dot method is slower.

        Hi Ted: thank you for spending the time to do the caliper benchmarks!

        It occurs to me that you don't need the final inverse trig function if you only want to rank distances.

        I absolutely agree. I do think its handy to get the real distance back in the TopFieldDocs from this function, and I think honestly this function is often way-overkill especially if you combine with many other scoring factors?

        I know Uwe has some stuff here like optimized flat-earth formulas, I hope he adds it in the future.

        Show
        Robert Muir added a comment - I build up a more principled test using caliper and now see the Fast.* methods coming in about 30% faster. Dot method is slower. Hi Ted: thank you for spending the time to do the caliper benchmarks! It occurs to me that you don't need the final inverse trig function if you only want to rank distances. I absolutely agree. I do think its handy to get the real distance back in the TopFieldDocs from this function, and I think honestly this function is often way-overkill especially if you combine with many other scoring factors? I know Uwe has some stuff here like optimized flat-earth formulas, I hope he adds it in the future.
        Hide
        Ted Dunning added a comment -

        It occurs to me that you don't need the final inverse trig function if you only want to rank distances.

        So if you want speed, peel that baby off. Results for comparison at https://microbenchmarks.appspot.com/runs/43b5e3b9-5d1b-4a6c-9aaf-d4834c82523d

        Show
        Ted Dunning added a comment - It occurs to me that you don't need the final inverse trig function if you only want to rank distances. So if you want speed, peel that baby off. Results for comparison at https://microbenchmarks.appspot.com/runs/43b5e3b9-5d1b-4a6c-9aaf-d4834c82523d
        Hide
        Ted Dunning added a comment -

        All is possible.

        I build up a more principled test using caliper and now see the Fast.* methods coming in about 30% faster. Dot method is slower.

        Benchmark results at https://microbenchmarks.appspot.com/runs/b79313c0-bbc0-45bf-9f96-8ae5ff0bac77

        Code for the benchmark is at https://gist.github.com/tdunning/6881908

        Show
        Ted Dunning added a comment - All is possible. I build up a more principled test using caliper and now see the Fast.* methods coming in about 30% faster. Dot method is slower. Benchmark results at https://microbenchmarks.appspot.com/runs/b79313c0-bbc0-45bf-9f96-8ae5ff0bac77 Code for the benchmark is at https://gist.github.com/tdunning/6881908
        Hide
        Uwe Schindler added a comment -

        I also tested with Math.cos and Math.asin and find that haversin runs almost twice faster if you avoid the fancy "fast" math stuff (at least on my laptop). The actual times are about 1.6 microseconds versus about .85 microseconds. Your mileage will vary on micro-benchmarks, of course.

        Laptop with smaller CPU cache?

        Show
        Uwe Schindler added a comment - I also tested with Math.cos and Math.asin and find that haversin runs almost twice faster if you avoid the fancy "fast" math stuff (at least on my laptop). The actual times are about 1.6 microseconds versus about .85 microseconds. Your mileage will vary on micro-benchmarks, of course. Laptop with smaller CPU cache?
        Hide
        Robert Muir added a comment -

        Can you share the code to your microbenchmark? I didnt write any microbenchmarks: I just use search queries, I definitely see the opposite.

        Show
        Robert Muir added a comment - Can you share the code to your microbenchmark? I didnt write any microbenchmarks: I just use search queries, I definitely see the opposite.
        Hide
        Ted Dunning added a comment -

        So I whipped up a quick test and find that the dot product form seems a bit slower.

        I also tested with Math.cos and Math.asin and find that haversin runs almost twice faster if you avoid the fancy "fast" math stuff (at least on my laptop). The actual times are about 1.6 microseconds versus about .85 microseconds. Your mileage will vary on micro-benchmarks, of course.

        Show
        Ted Dunning added a comment - So I whipped up a quick test and find that the dot product form seems a bit slower. I also tested with Math.cos and Math.asin and find that haversin runs almost twice faster if you avoid the fancy "fast" math stuff (at least on my laptop). The actual times are about 1.6 microseconds versus about .85 microseconds. Your mileage will vary on micro-benchmarks, of course.
        Hide
        Robert Muir added a comment -

        Also, isn't the dot product formulation for distance faster and simpler in the first place? I thought that the point of the spherical geometry formulations was that they allowed direct use of log trig tables in the bad old days of sextants and paper tables.

        Hmm at least with the naive formulation, its no faster at all? acos vs asin or whatever to solve for distance? Unless you have a trick?

        Show
        Robert Muir added a comment - Also, isn't the dot product formulation for distance faster and simpler in the first place? I thought that the point of the spherical geometry formulations was that they allowed direct use of log trig tables in the bad old days of sextants and paper tables. Hmm at least with the naive formulation, its no faster at all? acos vs asin or whatever to solve for distance? Unless you have a trick?
        Hide
        Ted Dunning added a comment -

        Hi Ted: I think you are looking at the logic backwards? it only falls thru for large values.

        Oops. You are right. I read it backwards. It is early in the morning here. (or something)

        Show
        Ted Dunning added a comment - Hi Ted: I think you are looking at the logic backwards? it only falls thru for large values. Oops. You are right. I read it backwards. It is early in the morning here. (or something)
        Hide
        Robert Muir added a comment -

        I know that it is all fashionable to dis on the Java cos and sin functions, but really... this code can never generate an angle greater than pi and yet it uses code that falls through to standard cos for anything smaller than MAX_INT * 1e9.

        Hi Ted: I think you are looking at the logic backwards? it only falls thru for large values.

        Did you test to see if this actually helps with the speed?

        Of course: using thousands of queries.

        Show
        Robert Muir added a comment - I know that it is all fashionable to dis on the Java cos and sin functions, but really... this code can never generate an angle greater than pi and yet it uses code that falls through to standard cos for anything smaller than MAX_INT * 1e9. Hi Ted: I think you are looking at the logic backwards? it only falls thru for large values. Did you test to see if this actually helps with the speed? Of course: using thousands of queries.
        Hide
        Ted Dunning added a comment -

        Also, isn't the dot product formulation for distance faster and simpler in the first place? I thought that the point of the spherical geometry formulations was that they allowed direct use of log trig tables in the bad old days of sextants and paper tables.

        Show
        Ted Dunning added a comment - Also, isn't the dot product formulation for distance faster and simpler in the first place? I thought that the point of the spherical geometry formulations was that they allowed direct use of log trig tables in the bad old days of sextants and paper tables.
        Hide
        Ted Dunning added a comment -

        I know that it is all fashionable to dis on the Java cos and sin functions, but really... this code can never generate an angle greater than pi and yet it uses code that falls through to standard cos for anything smaller than MAX_INT * 1e9.

        Did you test to see if this actually helps with the speed?

        Show
        Ted Dunning added a comment - I know that it is all fashionable to dis on the Java cos and sin functions, but really... this code can never generate an angle greater than pi and yet it uses code that falls through to standard cos for anything smaller than MAX_INT * 1e9. Did you test to see if this actually helps with the speed?
        Hide
        David Smiley added a comment -

        Really cool Rob! It wasn't obvious in the issue description but after looking at your patch, I see that you bring in here some other open-source implementations of cosine and sine that are not quite as accurate but run much faster. I might find a way to use these routines in Lucene-Spatial / Spatial4j.

        Show
        David Smiley added a comment - Really cool Rob! It wasn't obvious in the issue description but after looking at your patch, I see that you bring in here some other open-source implementations of cosine and sine that are not quite as accurate but run much faster. I might find a way to use these routines in Lucene-Spatial / Spatial4j.
        Hide
        ASF subversion and git services added a comment -

        Commit 1529685 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1529685 ]

        LUCENE-5258: add distance function to expressions/

        Show
        ASF subversion and git services added a comment - Commit 1529685 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1529685 ] LUCENE-5258 : add distance function to expressions/
        Hide
        ASF subversion and git services added a comment -

        Commit 1529679 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1529679 ]

        LUCENE-5258: add distance function to expressions/

        Show
        ASF subversion and git services added a comment - Commit 1529679 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1529679 ] LUCENE-5258 : add distance function to expressions/

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development