Solr
  1. Solr
  2. SOLR-2155

Geospatial search using geohash prefixes

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      NOTICE

      The status of this issue is a plugin for Solr 3.x located here: https://github.com/dsmiley/SOLR-2155. Look at the introductory readme and download the plugin .jar file. Lucene 4's new spatial module is largely based on this code. The Solr 4 glue for it should come very soon but as of this writing it's hosted temporarily at https://github.com/spatial4j. For more information on using SOLR-2155 with Solr 3, see http://wiki.apache.org/solr/SpatialSearch#SOLR-2155 This JIRA issue is closed because it won't be committed in its current form.

      There currently isn't a solution in Solr for doing geospatial filtering on documents that have a variable number of points. This scenario occurs when there is location extraction (i.e. via a "gazateer") occurring on free text. None, one, or many geospatial locations might be extracted from any given document and users want to limit their search results to those occurring in a user-specified area.

      I've implemented this by furthering the GeoHash based work in Lucene/Solr with a geohash prefix based filter. A geohash refers to a lat-lon box on the earth. Each successive character added further subdivides the box into a 4x8 (or 8x4 depending on the even/odd length of the geohash) grid. The first step in this scheme is figuring out which geohash grid squares cover the user's search query. I've added various extra methods to GeoHashUtils (and added tests) to assist in this purpose. The next step is an actual Lucene Filter, GeoHashPrefixFilter, that uses these geohash prefixes in TermsEnum.seek() to skip to relevant grid squares in the index. Once a matching geohash grid is found, the points therein are compared against the user's query to see if it matches. I created an abstraction GeoShape extended by subclasses named PointDistance... and CartesianBox.... to support different queried shapes so that the filter need not care about these details.

      This work was presented at LuceneRevolution in Boston on October 8th.

      1. GeoHashPrefixFilter.patch
        37 kB
        David Smiley
      2. GeoHashPrefixFilter.patch
        49 kB
        David Smiley
      3. GeoHashPrefixFilter.patch
        69 kB
        David Smiley
      4. SOLR.2155.p3.patch
        10 kB
        Bill Bell
      5. SOLR.2155.p3tests.patch
        86 kB
        Bill Bell
      6. SOLR-2155_GeoHashPrefixFilter_with_sorting_no_poly.patch
        177 kB
        David Smiley
      7. Solr2155-1.0.2-project.zip
        95 kB
        David Smiley
      8. Solr2155-for-1.0.2-3.x-port.patch
        5 kB
        Mikhail Khludnev
      9. Solr2155-1.0.3-project.zip
        96 kB
        David Smiley
      10. Solr2155-1.0.4-project.zip
        136 kB
        David Smiley

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          Kevenz, please ask your question on the Solr-user list. It doesn't pertain to SOLR-2155. I'll look for your question and answer there.

          Show
          David Smiley added a comment - Kevenz, please ask your question on the Solr-user list. It doesn't pertain to SOLR-2155 . I'll look for your question and answer there.
          Hide
          kevenz added a comment -

          hi David, I'm using solr 4.3, I have indexed docs with a polygon field, and I'd like to search the polygon docs according to the given point.

          I've put the jts-1.13.jar into the WEB-INF/lib directory, and I've added the doc to solr successfully. My question is how to search? I'm new to lucene and solr, any help would be appreciated.

          scheme.xml:
          <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
          spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
          distErrPct="0.025"
          maxDistErr="0.000009"
          units="degrees"
          />
          <field name="geo" type="location_rpt" indexed="true" stored="true" multiValued="true" />

          java code:
          String sql = "indexType:219" " AND " "geo:Contains(POINT(114.078327401257,22.5424866754136))";
          SolrQuery query = new SolrQuery();
          query.setQuery(sql);
          QueryResponse rsp = server.query(query);
          SolrDocumentList docsList = rsp.getResults();

          Then I got an error at "java.lang.IllegalArgumentException: missing parens: Contains". Is there any suggestion?

          Show
          kevenz added a comment - hi David, I'm using solr 4.3, I have indexed docs with a polygon field, and I'd like to search the polygon docs according to the given point. I've put the jts-1.13.jar into the WEB-INF/lib directory, and I've added the doc to solr successfully. My question is how to search? I'm new to lucene and solr, any help would be appreciated. scheme.xml: <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory" distErrPct="0.025" maxDistErr="0.000009" units="degrees" /> <field name="geo" type="location_rpt" indexed="true" stored="true" multiValued="true" /> java code: String sql = "indexType:219" " AND " "geo:Contains(POINT(114.078327401257,22.5424866754136))"; SolrQuery query = new SolrQuery(); query.setQuery(sql); QueryResponse rsp = server.query(query); SolrDocumentList docsList = rsp.getResults(); Then I got an error at "java.lang.IllegalArgumentException: missing parens: Contains". Is there any suggestion?
          Hide
          Sandeep Tucknat added a comment -

          Once again, thanks for the prompt response AND the information! It makes sense, you have a forward index of grid cells to documents, no geo comparisons required at run time. I won't be able to attend the conference but will definitely look forward to your presentation!

          Show
          Sandeep Tucknat added a comment - Once again, thanks for the prompt response AND the information! It makes sense, you have a forward index of grid cells to documents, no geo comparisons required at run time. I won't be able to attend the conference but will definitely look forward to your presentation!
          Hide
          David Smiley added a comment -

          I was just thinking that in order to do the ranking, the filter has to go through all the values of the field and it shouldn't be hard to persist this information and return to the client

          It doesn't; this is a common misconception! It does not calculate the distance between every indexed matched point and the center point of a circle query shape. If it did, it wouldn't be so fast If you want an implementation like that then look at LatLonType which is a brute force algorithm and hence is not as scalable, and doesn't support multi-value either. To try and help explain how this can possibly be, understand that there are large grid cells that fit within the query shape and for these large grid cells, the index knows which documents are all in that grid cell and so it simply matches those documents without knowing or calculating more precisely where those underlying points actually are. So it's not like the filter code has all this information you want and simply isn't exposing it to you. I need to drive this point home at my next conference presentation at Lucene/Solr Revolution 2013 in May (San Diego CA).

          Show
          David Smiley added a comment - I was just thinking that in order to do the ranking, the filter has to go through all the values of the field and it shouldn't be hard to persist this information and return to the client It doesn't; this is a common misconception! It does not calculate the distance between every indexed matched point and the center point of a circle query shape. If it did, it wouldn't be so fast If you want an implementation like that then look at LatLonType which is a brute force algorithm and hence is not as scalable, and doesn't support multi-value either. To try and help explain how this can possibly be, understand that there are large grid cells that fit within the query shape and for these large grid cells, the index knows which documents are all in that grid cell and so it simply matches those documents without knowing or calculating more precisely where those underlying points actually are. So it's not like the filter code has all this information you want and simply isn't exposing it to you. I need to drive this point home at my next conference presentation at Lucene/Solr Revolution 2013 in May (San Diego CA).
          Hide
          Sandeep Tucknat added a comment -

          First of all, thanks for the prompt response! It feels good to see you are supporting the approach we took in the interim I was just thinking that in order to do the ranking, the filter has to go through all the values of the field and it shouldn't be hard to persist this information and return to the client. We'll be waiting for that optimization to come through! Many thanks! let me know if I can help in any way (3 weeks in spatial or solr).

          Show
          Sandeep Tucknat added a comment - First of all, thanks for the prompt response! It feels good to see you are supporting the approach we took in the interim I was just thinking that in order to do the ranking, the filter has to go through all the values of the field and it shouldn't be hard to persist this information and return to the client. We'll be waiting for that optimization to come through! Many thanks! let me know if I can help in any way (3 weeks in spatial or solr).
          Hide
          David Smiley added a comment -

          Sujan, Sandeep,
          The filter doesn't ultimately know which, just that the document (business) matched. At the time you display the search results, which is only the top-X (20? 100?) you could then figure out which addresses matched and which is closest. Since this is only done on the limited number of documents you're displaying, it should scale fine. If your docs many many locations then ideally Solr would have a mechanism to filter out the locations outside the filter from the multi-value so that you needn't do this yourself client-side. That optimization is on my TODO list.

          For Solr 3 use SOLR-2155 (see the banner at the top of this JIRA issue) and for Solr 4, see the "location_rpt" field in the default schema to get started.

          Show
          David Smiley added a comment - Sujan, Sandeep, The filter doesn't ultimately know which, just that the document (business) matched. At the time you display the search results, which is only the top-X (20? 100?) you could then figure out which addresses matched and which is closest. Since this is only done on the limited number of documents you're displaying, it should scale fine. If your docs many many locations then ideally Solr would have a mechanism to filter out the locations outside the filter from the multi-value so that you needn't do this yourself client-side. That optimization is on my TODO list. For Solr 3 use SOLR-2155 (see the banner at the top of this JIRA issue) and for Solr 4, see the "location_rpt" field in the default schema to get started.
          Hide
          Sandeep Tucknat added a comment -

          I have a similar requirement to Sujan. I am doing a filtering spatial query, trying to find businesses (with multiple locations stored in a multi-valued field) available within a radius of a given point. I also need to know how many locations are actually within the radius as well as which one is the closest. Wondering if that's possible with the Solr 3 or 4 spatial implementation.

          Show
          Sandeep Tucknat added a comment - I have a similar requirement to Sujan. I am doing a filtering spatial query, trying to find businesses (with multiple locations stored in a multi-valued field) available within a radius of a given point. I also need to know how many locations are actually within the radius as well as which one is the closest. Wondering if that's possible with the Solr 3 or 4 spatial implementation.
          Hide
          Sujan added a comment -

          Is there a way to know the address within a document that matches the location search and not just the document,
          for example, i might have "Store A" with location addresses "100 Main Street, NY, 00001" and "100 NotMainStreet, NY, 00010" which are like say 40 miles apart. I search for "00002" and "100 Main Street, NY, 00001" matches, I want result to indicate location "100 Main Street, NY" of "Store A" matches, not sure if that functionality exists.

          Show
          Sujan added a comment - Is there a way to know the address within a document that matches the location search and not just the document, for example, i might have "Store A" with location addresses "100 Main Street, NY, 00001" and "100 NotMainStreet, NY, 00010" which are like say 40 miles apart. I search for "00002" and "100 Main Street, NY, 00001" matches, I want result to indicate location "100 Main Street, NY" of "Store A" matches, not sure if that functionality exists.
          Hide
          David Smiley added a comment -

          Robert, Please ask your question on the solr user list. Your question is primarily about retrieving large results. The spatial aspect seems irrelevant (it at least wasn't in your question). Note that gh_geofilt isn't known by most folks as it's the arbitrary name chosen to register a spatial query parser that only exists in the SOLR-2155 plugin for Solr 3. Solr 4 spatial is quite different.

          Show
          David Smiley added a comment - Robert, Please ask your question on the solr user list. Your question is primarily about retrieving large results. The spatial aspect seems irrelevant (it at least wasn't in your question). Note that gh_geofilt isn't known by most folks as it's the arbitrary name chosen to register a spatial query parser that only exists in the SOLR-2155 plugin for Solr 3. Solr 4 spatial is quite different.
          Hide
          Kristopher Davidsohn added a comment -

          Hi,
          I am no longer with the company. Your email has been forwarded to my supervisor for attention. If you need immediate assistance, please call CityGrid’s front desk at 310-360-4500. Thank you!
          Thank you.

          Show
          Kristopher Davidsohn added a comment - Hi, I am no longer with the company. Your email has been forwarded to my supervisor for attention. If you need immediate assistance, please call CityGrid’s front desk at 310-360-4500. Thank you! Thank you.
          Hide
          Robert Tseng added a comment -

          Hi All,

          New to Solr here! I have a question for you all on gh_geofilt. My document has rows of path, think of KML lineString, of which I want to do bounding box check which of them fall within a box. Each row basically has an id field and a multivalued field describing the line with multiple points.

          What I want returned is all lines that fall within but I read Solr is not very good, yet, in returning large number of hits. Hence the row params to limit result to top N rows. My two questions are:

          1. If want to retrieve all rows, do I query twice from solrj. Once to get number of hits so I can set the number of rows that grabs all row in a second call? Or two I should chunk up the query call using the start params as an offset?

          2. If it's only returning top N, is it based on score? What is considered high score? A row with most number of hits in the box? Cloest to the center?

          Show
          Robert Tseng added a comment - Hi All, New to Solr here! I have a question for you all on gh_geofilt. My document has rows of path, think of KML lineString, of which I want to do bounding box check which of them fall within a box. Each row basically has an id field and a multivalued field describing the line with multiple points. What I want returned is all lines that fall within but I read Solr is not very good, yet, in returning large number of hits. Hence the row params to limit result to top N rows. My two questions are: 1. If want to retrieve all rows, do I query twice from solrj. Once to get number of hits so I can set the number of rows that grabs all row in a second call? Or two I should chunk up the query call using the start params as an offset? 2. If it's only returning top N, is it based on score? What is considered high score? A row with most number of hits in the box? Cloest to the center?
          Hide
          David Smiley added a comment -
          announcement

          SOLR-3304 was just committed to Solr 4. If you are using SOLR-2155 in Solr 3, then you don't need to worry about patching Solr in Solr 4 or anything to get the same functionality (and more). The field type is SpatialRecursivePrefixTreeFieldType which defaults to geospatial use with geohashes. It also has a "distErrPct" option which specifies the precision of the shape which speeds things up some.

          Show
          David Smiley added a comment - announcement SOLR-3304 was just committed to Solr 4. If you are using SOLR-2155 in Solr 3, then you don't need to worry about patching Solr in Solr 4 or anything to get the same functionality (and more). The field type is SpatialRecursivePrefixTreeFieldType which defaults to geospatial use with geohashes. It also has a "distErrPct" option which specifies the precision of the shape which speeds things up some.
          Hide
          Kristopher Davidsohn added a comment -

          Hi David,
          Thanks for the response. Unfortunately I do already have that line in my solrconfig, and my index is optimized as well... Also for information's sake I'm running solr 3.4

          Show
          Kristopher Davidsohn added a comment - Hi David, Thanks for the response. Unfortunately I do already have that line in my solrconfig, and my index is optimized as well... Also for information's sake I'm running solr 3.4
          Hide
          David Smiley added a comment -

          Kristopher,
          Perhaps you didn't make the changes to solrconfig.xml that you need to make, namely:

          <!-- Optional: replace built-in geodist() with our own modified one for multi-valued geo sort -->
          <valueSourceParser name="geodist" class="solr2155.solr.search.function.distance.HaversineConstFunction$HaversineValueSourceParser" />
          

          This is documented on the solr wiki link at the top of this issue.

          Show
          David Smiley added a comment - Kristopher, Perhaps you didn't make the changes to solrconfig.xml that you need to make, namely: <!-- Optional: replace built-in geodist() with our own modified one for multi-valued geo sort --> <valueSourceParser name= "geodist" class= "solr2155.solr.search.function.distance.HaversineConstFunction$HaversineValueSourceParser" /> This is documented on the solr wiki link at the top of this issue.
          Hide
          Kristopher Davidsohn added a comment -

          Hi, I have the latest version of this patch, and in this version (and prior ones I've used as well actually) I could not get the distance sorting to work. I have a geohash field set up like so in the schema:

          <fieldType name="geohash" class="solr2155.solr.schema.GeoHashField" length="12" /> 
          <field name="locLatLon_hash" type="geohash" indexed="true" stored="true"/>

          and a sample query I am testing:

          q=(locName%3Ario^5.0)&sort=geodist()+asc&rows=15&start=0&fq={!bbox}&sfield=locLatLon_hash&pt=40.2114%2C-111.6980&d=80.45&qt=standard

          The results come out in no real discernable order (not distance, and not score) and it seems like I may be missing something. Does anyone have any advice or ideas on what might be the issue? Or is distance sorting not supported in this particular case?

          Show
          Kristopher Davidsohn added a comment - Hi, I have the latest version of this patch, and in this version (and prior ones I've used as well actually) I could not get the distance sorting to work. I have a geohash field set up like so in the schema: <fieldType name= "geohash" class= "solr2155.solr.schema.GeoHashField" length= "12" /> <field name= "locLatLon_hash" type= "geohash" indexed= "true" stored= "true" /> and a sample query I am testing: q=(locName%3Ario^5.0)&sort=geodist()+asc&rows=15&start=0&fq={!bbox}&sfield=locLatLon_hash&pt=40.2114%2C-111.6980&d=80.45&qt=standard The results come out in no real discernable order (not distance, and not score) and it seems like I may be missing something. Does anyone have any advice or ideas on what might be the issue? Or is distance sorting not supported in this particular case?
          Hide
          Alexander Kanarsky added a comment -

          Oleg, if you're talking about the ssplex thing, it is simple but stable. You can see how it works on our site (look for custom area search icon, right top corner on Map panel, for example http://www.trulia.com/for_sale/Fremont,CA/My Custom Area__cr_dFpvpgVae%40i]ne%40sjAg%40}H}XvBoFwhApXuAvgAah%40vQnhAmI`G`Jn`%40ehAbyA_sp)
          Please let me know if you have any questions about it.

          Show
          Alexander Kanarsky added a comment - Oleg, if you're talking about the ssplex thing, it is simple but stable. You can see how it works on our site (look for custom area search icon, right top corner on Map panel, for example http://www.trulia.com/for_sale/Fremont,CA/My Custom Area__cr_dFpvpgVae%40i]ne%40sjAg%40}H}XvBoFwhApXuAvgAah%40vQnhAmI`G`Jn`%40ehAbyA_sp) Please let me know if you have any questions about it.
          Hide
          Oleg Shevelyov added a comment -

          Hi David! Thanks for your quick response. At the moment I decided to go for JTeam's Spatial Solr Plugin (SSP 2.0) which has a patch with polygon search. The project looks well-documented and it seems faster to apply their stuff than modify 2155 sources. If tests show it doesn't work well, I'll get back to 2155 and let you know. Thank you anyway.

          Show
          Oleg Shevelyov added a comment - Hi David! Thanks for your quick response. At the moment I decided to go for JTeam's Spatial Solr Plugin (SSP 2.0) which has a patch with polygon search. The project looks well-documented and it seems faster to apply their stuff than modify 2155 sources. If tests show it doesn't work well, I'll get back to 2155 and let you know. Thank you anyway.
          Hide
          David Smiley added a comment -

          Hi Oleg. No, it was stripped out a long while ago. But come to think of it, now that this issue isn't going to get committed and is also hosted somewhere outside Apache (it's on GitHub), I can re-introduce the polygon support that was formerly there. It's not a priority for me right now but if you find the last .patch file on this issue that includes the JTS support (which in some comment above I mentioned stripping it out so you could grab the version prior to that), then you could resurrect it. There was just one source file, plus a small hook into my query parser above. JTS did all the work, really. If you want to try and bring it back, then do so and send me a pull-request on github. All said and done, it's a very small amount of work; the integration was done it just needs to be brought back.

          Show
          David Smiley added a comment - Hi Oleg. No, it was stripped out a long while ago. But come to think of it, now that this issue isn't going to get committed and is also hosted somewhere outside Apache (it's on GitHub), I can re-introduce the polygon support that was formerly there. It's not a priority for me right now but if you find the last .patch file on this issue that includes the JTS support (which in some comment above I mentioned stripping it out so you could grab the version prior to that), then you could resurrect it. There was just one source file, plus a small hook into my query parser above. JTS did all the work, really. If you want to try and bring it back, then do so and send me a pull-request on github. All said and done, it's a very small amount of work; the integration was done it just needs to be brought back.
          Hide
          Oleg Shevelyov added a comment -

          Hi David, does new 1.0.5 version include polygon search? If not, please, could you clarify where to apply GeoHashPrefixFilter patch? It doesn't match solr 3.1 sources, and obviously higher versions as well. I saw you mentioned that you successfully implemented polygon search but I still don't get how to make it work. Thanks

          Show
          Oleg Shevelyov added a comment - Hi David, does new 1.0.5 version include polygon search? If not, please, could you clarify where to apply GeoHashPrefixFilter patch? It doesn't match solr 3.1 sources, and obviously higher versions as well. I saw you mentioned that you successfully implemented polygon search but I still don't get how to make it work. Thanks
          Hide
          David Smiley added a comment -

          There is a new version, v1.0.5, available on my GitHub repo. Changelog:

          • Fixed bug affecting sorting by distance when the index was not in an optimized state.
          • Norms are omitted automatically now; they aren't used.

          The bug Bill reported is fairly serious and affects anyone doing sorting by this field when the index isn't optimized.

          Show
          David Smiley added a comment - There is a new version, v1.0.5, available on my GitHub repo . Changelog: Fixed bug affecting sorting by distance when the index was not in an optimized state. Norms are omitted automatically now; they aren't used. The bug Bill reported is fairly serious and affects anyone doing sorting by this field when the index isn't optimized.
          Hide
          David Smiley added a comment -

          I updated this issue's leading description info box to point to my GitHub repo for this code and its evolution. It's also where the releases are posted now.

          Show
          David Smiley added a comment - I updated this issue's leading description info box to point to my GitHub repo for this code and its evolution. It's also where the releases are posted now.
          Hide
          Bill Bell added a comment -

          sort is definitely not working in all cases using geohash field.

          I am going to compare with other sorts... Maybe we need to extend another class?

          Show
          Bill Bell added a comment - sort is definitely not working in all cases using geohash field. I am going to compare with other sorts... Maybe we need to extend another class?
          Hide
          Bill Bell added a comment -

          The distance is calculated properly, but the sort is not working. Probably due to the wrong function being called.

          You need to add a test case to check the sort order coming back from Solr.

          Show
          Bill Bell added a comment - The distance is calculated properly, but the sort is not working. Probably due to the wrong function being called. You need to add a test case to check the sort order coming back from Solr.
          Hide
          Bill Bell added a comment -

          Ohh. This is interesting, if I switch to sort using the new geodisth() for the sfield=store_lat_lon fields I get the same results as what I use sfield=store_geohash (the wrong ones).

          Show
          Bill Bell added a comment - Ohh. This is interesting, if I switch to sort using the new geodisth() for the sfield=store_lat_lon fields I get the same results as what I use sfield=store_geohash (the wrong ones).
          Hide
          Bill Bell added a comment -

          Could it be if they only have 1 entry in the store_geohash field?

          Show
          Bill Bell added a comment - Could it be if they only have 1 entry in the store_geohash field?
          Hide
          Bill Bell added a comment -

          I tried SOLR 2155 1.0.3 and the same results.

          I am able to simplify the issue with a simple query:

          http://localhost:8983/solr/citystateprovider/select?fl=score,display_name,store_lat_lon,store_geohash,city_state&q.alt=*:*&d=100&pt=39.740112,-104.984856&defType=dismax&rows=100&echoParams=all&sfield=store_geohash&sort=geodisth%28%29%20asc&
          

          AND the one that works:

          
          http://localhost:8983/solr/citystateprovider/select?fl=score,display_name,store_lat_lon,store_geohash,city_state&q.alt=*:*&pt=39.740112,-104.984856&defType=dismax&rows=100&echoParams=all&sfield=store_lat_lon&sort=geodist%28%29%20asc&
          
          
          Show
          Bill Bell added a comment - I tried SOLR 2155 1.0.3 and the same results. I am able to simplify the issue with a simple query: http: //localhost:8983/solr/citystateprovider/select?fl=score,display_name,store_lat_lon,store_geohash,city_state&q.alt=*:*&d=100&pt=39.740112,-104.984856&defType=dismax&rows=100&echoParams=all&sfield=store_geohash&sort=geodisth%28%29%20asc& AND the one that works: http: //localhost:8983/solr/citystateprovider/select?fl=score,display_name,store_lat_lon,store_geohash,city_state&q.alt=*:*&pt=39.740112,-104.984856&defType=dismax&rows=100&echoParams=all&sfield=store_lat_lon&sort=geodist%28%29%20asc&
          Hide
          Bill Bell added a comment -

          This is happening on SOLR 3.5 and SOLR 3.6. I did not check any other versions.

          I changed the geodist() to name it geodisth() so I can try the LatLong and the GeoHash versions. They are different.

          I tried dismax and edismax... No difference. This is not the only query that does this. Very consistent.

          Here is the query:

          http://localhost:8983/solr/select?fl=score,display_name,store_lat_lon,store_geohash,city_state&qt=autoproviderdist&d=100&pt=39.740112,-104.984856&defType=dismax&rows=100&q=shawn%20nakamura&echoParams=all&sfield=store_geohash&sort=geodisth%28%29%20asc&

          
            <lst name="params">
              <str name="mm">1</str>
              <str name="d">100</str>
              <str name="sort">geodisth() asc</str>
              <str name="tie">0.01</str>
              <str name="sfield">store_geohash</str>
              <str name="qf">name_edgy name_edge name_word</str>
              <str name="q.alt">*:*</str>
              <str name="group.main">false</str>
              <str name="hl.fl">name_edgy</str>
              <str name="hl">false</str>
              <str name="defType">dismax</str>
              <str name="rows">100</str>
              <str name="echoParams">all</str>
              <str name="fl">score,display_name,store_lat_lon,store_geohash,city_state</str>
              <str name="pt">39.740112,-104.984856</str>
              <str name="boost">sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)</str>
              <str name="group.field">pwid</str>
              <str name="group">false</str>
              <str name="d">100</str>
              <str name="sort">geodisth() asc</str>
              <str name="sfield">store_geohash</str>
              <str name="group.main">false</str>
              <str name="rows">100</str>
              <str name="defType">dismax</str>
              <str name="echoParams">all</str>
              <str name="debugQuery">true</str>
              <str name="fl">score,display_name,store_lat_lon,store_geohash,city_state</str>
              <str name="q">shawn nakamura</str>
              <str name="pt">39.740112,-104.984856</str>
              <str name="group">false</str>
              <str name="qt">autoproviderdist</str>
            </lst>
          
          

          Results using geohash:

          
            <doc>
              <float name="score">1.7336502</float>
              <str name="city_state">Arvada, CO</str>
              <str name="display_name">Shawn M. XXX</str>
              <arr name="store_geohash">
                <str>39.8184319306165,-105.1404038630426</str>
              </arr>
              <str name="store_lat_lon">39.818432,-105.140404</str>
            </doc>
            <doc>
              <float name="score">1.7336502</float>
              <str name="city_state">Quincy, MA</str>
              <str name="display_name">Shawna XXX</str>
              <arr name="store_geohash">
                <str>42.22851206548512,-71.03219585493207</str>
              </arr>
              <str name="store_lat_lon">42.228512,-71.032196</str>
            </doc>
            <doc>
              <float name="score">1.7336502</float>
              <str name="city_state">Portsmouth, NH</str>
              <str name="display_name">Shawn A. XXX, LSW</str>
              <arr name="store_geohash">
                <str>43.07758695445955,-70.75780486688018</str>
              </arr>
              <str name="store_lat_lon">43.077587,-70.757805</str>
            </doc>
            <doc>
              <float name="score">1.7336502</float>
              <str name="city_state">XXX, HI</str>
              <str name="display_name">Shawn D. XXX, NP</str>
              <arr name="store_geohash">
                <str>21.49900102056563,-158.07000713422894</str>
              </arr>
              <str name="store_lat_lon">21.499001,-158.070007</str>
            </doc>
          

          Results using LatLon:

            <doc>
              <float name="score">1.7336502</float>
              <str name="city_state">Aurora, CO</str>
              <str name="display_name">Shawna L. XXX, RN</str>
              <arr name="store_geohash">
                <str>39.669720036908984,-104.86465500667691</str>
              </arr>
              <str name="store_lat_lon">39.669720,-104.864655</str>
            </doc>
            <doc>
              <float name="score">1.7336502</float>
              <str name="city_state">Arvada, CO</str>
              <str name="display_name">Shawn M. XXX</str>
              <arr name="store_geohash">
                <str>39.8184319306165,-105.1404038630426</str>
              </arr>
              <str name="store_lat_lon">39.818432,-105.140404</str>
            </doc>
            <doc>
              <float name="score">1.7336502</float>
              <str name="city_state">Albuquerque, NM</str>
              <str name="display_name">Shawn C. XXX, PA</str>
              <arr name="store_geohash">
                <str>35.13116103596985,-106.5403369255364</str>
              </arr>
              <str name="store_lat_lon">35.131161,-106.540337</str>
            </doc>
          
          Show
          Bill Bell added a comment - This is happening on SOLR 3.5 and SOLR 3.6. I did not check any other versions. I changed the geodist() to name it geodisth() so I can try the LatLong and the GeoHash versions. They are different. I tried dismax and edismax... No difference. This is not the only query that does this. Very consistent. Here is the query: http://localhost:8983/solr/select?fl=score,display_name,store_lat_lon,store_geohash,city_state&qt=autoproviderdist&d=100&pt=39.740112,-104.984856&defType=dismax&rows=100&q=shawn%20nakamura&echoParams=all&sfield=store_geohash&sort=geodisth%28%29%20asc& <lst name= "params" > <str name= "mm" >1</str> <str name= "d" >100</str> <str name= "sort" >geodisth() asc</str> <str name= "tie" >0.01</str> <str name= "sfield" >store_geohash</str> <str name= "qf" >name_edgy name_edge name_word</str> <str name= "q.alt" >*:*</str> <str name= "group.main" > false </str> <str name= "hl.fl" >name_edgy</str> <str name= "hl" > false </str> <str name= "defType" >dismax</str> <str name= "rows" >100</str> <str name= "echoParams" >all</str> <str name= "fl" >score,display_name,store_lat_lon,store_geohash,city_state</str> <str name= "pt" >39.740112,-104.984856</str> <str name= "boost" >sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)</str> <str name= "group.field" >pwid</str> <str name= "group" > false </str> <str name= "d" >100</str> <str name= "sort" >geodisth() asc</str> <str name= "sfield" >store_geohash</str> <str name= "group.main" > false </str> <str name= "rows" >100</str> <str name= "defType" >dismax</str> <str name= "echoParams" >all</str> <str name= "debugQuery" > true </str> <str name= "fl" >score,display_name,store_lat_lon,store_geohash,city_state</str> <str name= "q" >shawn nakamura</str> <str name= "pt" >39.740112,-104.984856</str> <str name= "group" > false </str> <str name= "qt" >autoproviderdist</str> </lst> Results using geohash: <doc> < float name= "score" >1.7336502</ float > <str name= "city_state" >Arvada, CO</str> <str name= "display_name" >Shawn M. XXX</str> <arr name= "store_geohash" > <str>39.8184319306165,-105.1404038630426</str> </arr> <str name= "store_lat_lon" >39.818432,-105.140404</str> </doc> <doc> < float name= "score" >1.7336502</ float > <str name= "city_state" >Quincy, MA</str> <str name= "display_name" >Shawna XXX</str> <arr name= "store_geohash" > <str>42.22851206548512,-71.03219585493207</str> </arr> <str name= "store_lat_lon" >42.228512,-71.032196</str> </doc> <doc> < float name= "score" >1.7336502</ float > <str name= "city_state" >Portsmouth, NH</str> <str name= "display_name" >Shawn A. XXX, LSW</str> <arr name= "store_geohash" > <str>43.07758695445955,-70.75780486688018</str> </arr> <str name= "store_lat_lon" >43.077587,-70.757805</str> </doc> <doc> < float name= "score" >1.7336502</ float > <str name= "city_state" >XXX, HI</str> <str name= "display_name" >Shawn D. XXX, NP</str> <arr name= "store_geohash" > <str>21.49900102056563,-158.07000713422894</str> </arr> <str name= "store_lat_lon" >21.499001,-158.070007</str> </doc> Results using LatLon: <doc> < float name= "score" >1.7336502</ float > <str name= "city_state" >Aurora, CO</str> <str name= "display_name" >Shawna L. XXX, RN</str> <arr name= "store_geohash" > <str>39.669720036908984,-104.86465500667691</str> </arr> <str name= "store_lat_lon" >39.669720,-104.864655</str> </doc> <doc> < float name= "score" >1.7336502</ float > <str name= "city_state" >Arvada, CO</str> <str name= "display_name" >Shawn M. XXX</str> <arr name= "store_geohash" > <str>39.8184319306165,-105.1404038630426</str> </arr> <str name= "store_lat_lon" >39.818432,-105.140404</str> </doc> <doc> < float name= "score" >1.7336502</ float > <str name= "city_state" >Albuquerque, NM</str> <str name= "display_name" >Shawn C. XXX, PA</str> <arr name= "store_geohash" > <str>35.13116103596985,-106.5403369255364</str> </arr> <str name= "store_lat_lon" >35.131161,-106.540337</str> </doc>
          Hide
          David Smiley added a comment -

          Weird. Is this reproducible using an index of just these 4 documents (all geo single-valued, apparently) with your query here?

          Show
          David Smiley added a comment - Weird. Is this reproducible using an index of just these 4 documents (all geo single-valued, apparently) with your query here?
          Hide
          Bill Bell added a comment -

          I am getting weird results on sort=geodist() asc when using the latest version 1.0.4....

          &sfield=store_geohash&pt=39.740112,-104.984856&sort=geodist() asc

          Notice I am getting Denver, Texas and then Denver again?

          <doc>
          <float name="score">1.526444</float>
          <str name="city_state">Denver, CO</str>
          <str name="display_name">Shawna M. D</str>
          <str name="store_lat_lon">39.740009,-104.992264</str>
          </doc>
          <doc>
          <float name="score">2.9680724</float>
          <str name="city_state">Denver, CO</str>
          <str name="display_name">Meghan F. N, PA</str>
          <str name="store_lat_lon">39.728024,-104.990250</str>
          </doc>
          <doc>
          <float name="score">1.526444</float>
          <str name="city_state">San Antonio, TX</str>
          <str name="display_name">Dr. Shawn K. F, DO</str>
          <str name="store_lat_lon">31.436729,-99.306923</str>
          </doc>
          <doc>
          <float name="score">1.526444</float>
          <str name="city_state">Denver, CO</str>
          <str name="display_name">Dr. Shawn A. N, DO</str>
          <str name="store_lat_lon">39.718670,-104.988907</str>
          </doc>

          Show
          Bill Bell added a comment - I am getting weird results on sort=geodist() asc when using the latest version 1.0.4.... &sfield=store_geohash&pt=39.740112,-104.984856&sort=geodist() asc Notice I am getting Denver, Texas and then Denver again? <doc> <float name="score">1.526444</float> <str name="city_state">Denver, CO</str> <str name="display_name">Shawna M. D</str> <str name="store_lat_lon">39.740009,-104.992264</str> </doc> <doc> <float name="score">2.9680724</float> <str name="city_state">Denver, CO</str> <str name="display_name">Meghan F. N, PA</str> <str name="store_lat_lon">39.728024,-104.990250</str> </doc> <doc> <float name="score">1.526444</float> <str name="city_state">San Antonio, TX</str> <str name="display_name">Dr. Shawn K. F, DO</str> <str name="store_lat_lon">31.436729,-99.306923</str> </doc> <doc> <float name="score">1.526444</float> <str name="city_state">Denver, CO</str> <str name="display_name">Dr. Shawn A. N, DO</str> <str name="store_lat_lon">39.718670,-104.988907</str> </doc>
          Hide
          David Smiley added a comment -

          I don't recall ever declaring that the range query syntax works. I reviewed the code and I didn't override getRangeQuery() which is required for it to work correctly – instead the parent class's implementation is in effect which unsurprisingly doesn't work. Instead, use my query parser to supply the query box.

          Show
          David Smiley added a comment - I don't recall ever declaring that the range query syntax works. I reviewed the code and I didn't override getRangeQuery() which is required for it to work correctly – instead the parent class's implementation is in effect which unsurprisingly doesn't work. Instead, use my query parser to supply the query box.
          Hide
          Harley Parks added a comment -

          my assumption is that coordinates are lat, lng
          and, if north is up on the page, the range is from the Lower-Left TO Upper-Right.

          examples of range query:

          q=GeoTagGeoHash:[-15,149 TO -4,165]
          q=GeoTagGeoHash:[19,-160 TO 23,-154]
          q=GeoTagGeoHash:[10,20 TO 30,40]

          Show
          Harley Parks added a comment - my assumption is that coordinates are lat, lng and, if north is up on the page, the range is from the Lower-Left TO Upper-Right. examples of range query: q=GeoTagGeoHash: [-15,149 TO -4,165] q=GeoTagGeoHash: [19,-160 TO 23,-154] q=GeoTagGeoHash: [10,20 TO 30,40]
          Hide
          Harley Parks added a comment -

          Thanks David. so, i will be sure to add the custom cache.

          running into some issues when using a range query on a multiValue GeoHash.

          sometimes the values are outside of the range provided.

          is that expected? and if so, why?
          Thanks

          Show
          Harley Parks added a comment - Thanks David. so, i will be sure to add the custom cache. running into some issues when using a range query on a multiValue GeoHash. sometimes the values are outside of the range provided. is that expected? and if so, why? Thanks
          Hide
          David Smiley added a comment -

          Ok so I got to the bottom of this. When I originally coded the cache entry thing, my intention was to use the same cache as UnInvertedField (multi-value faceting) – THE "fieldValueCache" since it seemed related. But I did that wrong and instead it looked up a custom user cache by that same name. The difference is that the official one is configured with <fieldValueCache> and the other is <cache name="fieldValueCache"> (documented in the read me). In hind site, I should have chosen some special name like "solr2155". Now, if you don't configure the cache then each time you try to sort, it has to rebuild the cache. So basically it's required to add this cache configuration assuming your sorting.

          Show
          David Smiley added a comment - Ok so I got to the bottom of this. When I originally coded the cache entry thing, my intention was to use the same cache as UnInvertedField (multi-value faceting) – THE "fieldValueCache" since it seemed related. But I did that wrong and instead it looked up a custom user cache by that same name. The difference is that the official one is configured with <fieldValueCache> and the other is <cache name="fieldValueCache"> (documented in the read me). In hind site, I should have chosen some special name like "solr2155". Now, if you don't configure the cache then each time you try to sort, it has to rebuild the cache. So basically it's required to add this cache configuration assuming your sorting.
          Hide
          Harley Parks added a comment -

          interesting... conversation regarding both tuning and data requirements.
          I would guess that the size needs to be in power of 2's

          since the example given is related to a custom type, it should be noted that a fieldValueCache is by default created, for each document id.

          the example given by default in solr 3.4 if needed:
          <fieldValueCache class="solr.FastLRUCache"
          size="512"
          autowarmCount="128"
          showItems="32" />

          Additionally, the custom cache uses the same name, which might not matter, or does it? the custom cache may override the fieldValueCache created by default.

          Show
          Harley Parks added a comment - interesting... conversation regarding both tuning and data requirements. I would guess that the size needs to be in power of 2's since the example given is related to a custom type, it should be noted that a fieldValueCache is by default created, for each document id. the example given by default in solr 3.4 if needed: <fieldValueCache class="solr.FastLRUCache" size="512" autowarmCount="128" showItems="32" /> Additionally, the custom cache uses the same name, which might not matter, or does it? the custom cache may override the fieldValueCache created by default.
          Hide
          Harley Parks added a comment -

          interesting... conversation regarding both tuning and data requirements.
          I would guess that the size needs to be in power of 2's

          since the example given is related to a custom type, it should be noted that a fieldValueCache is by default created, for each document id.

          the example given by default in solr 3.4 if needed:
          <fieldValueCache class="solr.FastLRUCache"
          size="512"
          autowarmCount="128"
          showItems="32" />

          Additionally, the custom cache uses the same name, which might not matter, or does it? the custom cache may override the fieldValueCache created by default.

          Show
          Harley Parks added a comment - interesting... conversation regarding both tuning and data requirements. I would guess that the size needs to be in power of 2's since the example given is related to a custom type, it should be noted that a fieldValueCache is by default created, for each document id. the example given by default in solr 3.4 if needed: <fieldValueCache class="solr.FastLRUCache" size="512" autowarmCount="128" showItems="32" /> Additionally, the custom cache uses the same name, which might not matter, or does it? the custom cache may override the fieldValueCache created by default.
          Hide
          Bill Bell added a comment -

          It seems that a max size of 10 would be too small. If we have an average of 3 geohash fields per doc, and we have 2M rows how do we set these caches?

          We will get back to you on when it occurs. We don't commit during the day. Maybe during a gc?

          Show
          Bill Bell added a comment - It seems that a max size of 10 would be too small. If we have an average of 3 geohash fields per doc, and we have 2M rows how do we set these caches? We will get back to you on when it occurs. We don't commit during the day. Maybe during a gc?
          Hide
          David Smiley added a comment -

          Bill,
          I don't think 1.0.4 introduced any problem as it was fairly trivial. The INFO log message tells me you are using multi-value sorting which has to put all the values into memory after each commit. Did a commit happen prior to this log message? FYI you should put a warming query into newSearcher for any sorting you do in Solr so that a user never sees the time hit for loading related caches. Mikhail Khludnev suggested the fieldValueCache be explicitly configured but I don't see it being relevant. AFAIK each sorted or faceted multi-valued field gets one entry in that cache.

          Show
          David Smiley added a comment - Bill, I don't think 1.0.4 introduced any problem as it was fairly trivial. The INFO log message tells me you are using multi-value sorting which has to put all the values into memory after each commit. Did a commit happen prior to this log message? FYI you should put a warming query into newSearcher for any sorting you do in Solr so that a user never sees the time hit for loading related caches. Mikhail Khludnev suggested the fieldValueCache be explicitly configured but I don't see it being relevant. AFAIK each sorted or faceted multi-valued field gets one entry in that cache.
          Hide
          Bill Bell added a comment -

          David,

          We are seeing weird slow performance on your new 1.0.4 release.

          INFO: [providersearch] webapp=/solr path=/select params=

          {d=160.9344&facet=false&wt=json&rows=6&start=0&pt=42.6450,-73.7777&facet.field=5star_45&f.5star_45.facet.mincount=1&qt=providersearchspecdist&fq=specialties_ids:(45+)&qq=city_state_lower:"albany,+ny"&f.5star_45.facet.limit=-1}

          hits=960 status=0 QTime=8222

          Hitting that with a slightly different lat long comes back almost instantly. I'm not sure why sometimes they take seconds instead of milliseconds. There is also this log entry a few lines before the long query:

          Mar 22, 2012 11:26:29 AM solr2155.solr.search.function.GeoHashValueSource <init>
          INFO: field 'store_geohash' in RAM: loaded min/avg/max per doc #: (1,1.1089503,11) #2270017

          Are we missing something? Shall we go back to 1.0.3 ?

          Shall I increase the following? What does this actually do?

          <cache name="fieldValueCache"
          class="solr.FastLRUCache" size="10" initialSize="1"
          autowarmCount="1"/>

          Show
          Bill Bell added a comment - David, We are seeing weird slow performance on your new 1.0.4 release. INFO: [providersearch] webapp=/solr path=/select params= {d=160.9344&facet=false&wt=json&rows=6&start=0&pt=42.6450,-73.7777&facet.field=5star_45&f.5star_45.facet.mincount=1&qt=providersearchspecdist&fq=specialties_ids:(45+)&qq=city_state_lower:"albany,+ny"&f.5star_45.facet.limit=-1} hits=960 status=0 QTime=8222 Hitting that with a slightly different lat long comes back almost instantly. I'm not sure why sometimes they take seconds instead of milliseconds. There is also this log entry a few lines before the long query: Mar 22, 2012 11:26:29 AM solr2155.solr.search.function.GeoHashValueSource <init> INFO: field 'store_geohash' in RAM: loaded min/avg/max per doc #: (1,1.1089503,11) #2270017 Are we missing something? Shall we go back to 1.0.3 ? Shall I increase the following? What does this actually do? <cache name="fieldValueCache" class="solr.FastLRUCache" size="10" initialSize="1" autowarmCount="1"/>
          Hide
          Harley Parks added a comment -

          Just doing some testing on the new jar file.
          are there rules on how to structure bounding box. lower left is south west and upper right is north east?

          using the gh_geohash, was tricky as it's coordinates are flipped: long,lat to get west, south, east, north box.
          but works!

          {!gh_geofilt%20sfield=GeoTagGeoHash%20box="129,-16,-180,1"}

          q=:&fq=

          {!geofilt sfield=GeoTagGeoHash pt=60,100 d=1}

          ranged query works too:
          q=GeoTagGeoHash:[-15,149 TO -4,165]
          q=GeoTagGeoHash:[19,-160 TO 23,-154]

          good stuff.

          Show
          Harley Parks added a comment - Just doing some testing on the new jar file. are there rules on how to structure bounding box. lower left is south west and upper right is north east? using the gh_geohash, was tricky as it's coordinates are flipped: long,lat to get west, south, east, north box. but works! {!gh_geofilt%20sfield=GeoTagGeoHash%20box="129,-16,-180,1"} q= : &fq= {!geofilt sfield=GeoTagGeoHash pt=60,100 d=1} ranged query works too: q=GeoTagGeoHash: [-15,149 TO -4,165] q=GeoTagGeoHash: [19,-160 TO 23,-154] good stuff.
          Hide
          Harley Parks added a comment -

          Just doing some testing on the new jar file.
          are there rules on how to structure bounding box. lower left is south west and upper right is north east?

          using the gh_geohash, was tricky as it's coordinates are flipped: long,lat to get west, south, east, north box.
          but works!

          {!gh_geofilt%20sfield=GeoTagGeoHash%20box="129,-16,-180,1"}

          q=:&fq=

          {!geofilt sfield=GeoTagGeoHash pt=60,100 d=1}

          ranged query works too:
          q=GeoTagGeoHash:[-15,149 TO -4,165]
          q=GeoTagGeoHash:[19,-160 TO 23,-154]

          good stuff.

          Show
          Harley Parks added a comment - Just doing some testing on the new jar file. are there rules on how to structure bounding box. lower left is south west and upper right is north east? using the gh_geohash, was tricky as it's coordinates are flipped: long,lat to get west, south, east, north box. but works! {!gh_geofilt%20sfield=GeoTagGeoHash%20box="129,-16,-180,1"} q= : &fq= {!geofilt sfield=GeoTagGeoHash pt=60,100 d=1} ranged query works too: q=GeoTagGeoHash: [-15,149 TO -4,165] q=GeoTagGeoHash: [19,-160 TO 23,-154] good stuff.
          Hide
          Bill Bell added a comment -

          I already had updated http://wiki.apache.org/solr/SpatialSearchDev with the SOLR-2155 info.

          Show
          Bill Bell added a comment - I already had updated http://wiki.apache.org/solr/SpatialSearchDev with the SOLR-2155 info.
          Hide
          Harley Parks added a comment -

          Nice! Thanks David, that really helps us out.

          Show
          Harley Parks added a comment - Nice! Thanks David, that really helps us out.
          Hide
          David Smiley added a comment -

          I am attaching Solr2155-1.0.4-project.zip.
          Changes:

          • Fixed bug in which the Solr's XML response showed the field as a geohash instead of lat-lon. This bug was not present for other response formats.
          • Include pre-built .jar in the zip for convenience. README.txt enhanced a little too.

          And FYI I added some info about SOLR-2155 on Solr's SpatialSearch wiki page.

          As I was looking through the source, I realized I incorrectly once stated in the comments here that the stored value returned from a search would be the same no matter what geohash length you configure. That's not true; you'd have to use another field for the stored value if you want to retain the original precision.

          Show
          David Smiley added a comment - I am attaching Solr2155-1.0.4-project.zip. Changes: Fixed bug in which the Solr's XML response showed the field as a geohash instead of lat-lon. This bug was not present for other response formats. Include pre-built .jar in the zip for convenience. README.txt enhanced a little too. And FYI I added some info about SOLR-2155 on Solr's SpatialSearch wiki page . As I was looking through the source, I realized I incorrectly once stated in the comments here that the stored value returned from a search would be the same no matter what geohash length you configure. That's not true; you'd have to use another field for the stored value if you want to retain the original precision.
          Hide
          Bill Bell added a comment - - edited

          D is in km and radius is in meters. So Ou would need radius=5000

          Geofilt does to use point and radius as far as I can tell.

          Also gh_geofilt does not use pt and d.

          I just don't want people confused.

          Show
          Bill Bell added a comment - - edited D is in km and radius is in meters. So Ou would need radius=5000 Geofilt does to use point and radius as far as I can tell. Also gh_geofilt does not use pt and d. I just don't want people confused.
          Hide
          Harley Parks added a comment -

          Bill:

          the main advantage is gh_geofilt (or the name in queryParser settup in solrconfig.xml) can be used as part of the query versus the filter... at least that is what I thought... here are my test strings in solr/admin/form

          it might be helpful to be able to pass parameters in like geodist(sfield,lat,lng).
          gh_geofilt(sfield,lat,lng,d)

          this worked for me in the solr/admin/form:

          !gh_geofilt sfield=GeoTagGeoHash pt=-9,160 d=5

          this also worked for me in the solr/admin/form:

          !gh_geofilt sfield=GeoTagGeoHash point=-9,160 radius=5

          but this also worked:
          !geofilt sfield=GeoTagGeoHash point=-9,160 radius=5
          !geofilt sfield=GeoTagGeoHash pt=-9,160 d=5

          Show
          Harley Parks added a comment - Bill: the main advantage is gh_geofilt (or the name in queryParser settup in solrconfig.xml) can be used as part of the query versus the filter... at least that is what I thought... here are my test strings in solr/admin/form it might be helpful to be able to pass parameters in like geodist(sfield,lat,lng). gh_geofilt(sfield,lat,lng,d) this worked for me in the solr/admin/form: !gh_geofilt sfield=GeoTagGeoHash pt=-9,160 d=5 this also worked for me in the solr/admin/form: !gh_geofilt sfield=GeoTagGeoHash point=-9,160 radius=5 but this also worked: !geofilt sfield=GeoTagGeoHash point=-9,160 radius=5 !geofilt sfield=GeoTagGeoHash pt=-9,160 d=5
          Hide
          Bill Bell added a comment -

          I did figure out the

          {!gh_geofilt}

          The parameters are point and radius, not pt and d.

          Radius also is in meters not km.

          The performance of this gh geofilt is also most the same as geofilt. So not sure why you would need it.

          Show
          Bill Bell added a comment - I did figure out the {!gh_geofilt} The parameters are point and radius, not pt and d. Radius also is in meters not km. The performance of this gh geofilt is also most the same as geofilt. So not sure why you would need it.
          Hide
          Harley Parks added a comment -

          Sorry, for the endless droning.
          But, for now, the plan is to store the latlong and the geohash in two fields.
          search on the geohash, and map on the latlong.

          Show
          Harley Parks added a comment - Sorry, for the endless droning. But, for now, the plan is to store the latlong and the geohash in two fields. search on the geohash, and map on the latlong.
          Hide
          Harley Parks added a comment -

          All of the Class Paths in the solr1.0.3 project point to apache solr 3.4 libraries on the apache website... so no action needed, to answer my own question. I'm stumped.

          Show
          Harley Parks added a comment - All of the Class Paths in the solr1.0.3 project point to apache solr 3.4 libraries on the apache website... so no action needed, to answer my own question. I'm stumped.
          Hide
          Harley Parks added a comment -

          Oh.. I may have messed up my build, since i did not include the solr 3.4 jar files in the class path...
          is there an enviorment variable that maven will use? such as CLASSPATH or a lib folder in the project being built?

          Show
          Harley Parks added a comment - Oh.. I may have messed up my build, since i did not include the solr 3.4 jar files in the class path... is there an enviorment variable that maven will use? such as CLASSPATH or a lib folder in the project being built?
          Hide
          Harley Parks added a comment -

          Bill:
          the doc's on queryParser state the name of the function can then be used as the main query, gh_geofilt, perhaps something like: /select?q=

          {!gh_geofilt}

          ... but, good question.
          geofilt is working for me on multivalued fields.

          my issue is the query result returns the geohash string, not the geohash lat, long.
          In building the v 1.0.3 jar file for solr2155, I used jdk 6. as I didn't see any errors, so hopefully, that's fine.
          so, I'm going to see if solr 3.5 will perhaps resolve my issue.

          Show
          Harley Parks added a comment - Bill: the doc's on queryParser state the name of the function can then be used as the main query, gh_geofilt, perhaps something like: /select?q= {!gh_geofilt} ... but, good question. geofilt is working for me on multivalued fields. my issue is the query result returns the geohash string, not the geohash lat, long. In building the v 1.0.3 jar file for solr2155, I used jdk 6. as I didn't see any errors, so hopefully, that's fine. so, I'm going to see if solr 3.5 will perhaps resolve my issue.
          Hide
          Bill Bell added a comment -

          David,

          What is an example URL call for multiValued field? Does geofilt work?

          /select?q=:&fq=

          {!geofilt}

          &sort=geodist() asc&sfield=store_hash&d=10

          Or do we need to use gh_geofilt? like this?

          /select?q=:&fq=

          {!gh_geofilt}

          &sort=geodist() asc&sfield=store_hash&d=10

          Show
          Bill Bell added a comment - David, What is an example URL call for multiValued field? Does geofilt work? /select?q= : &fq= {!geofilt} &sort=geodist() asc&sfield=store_hash&d=10 Or do we need to use gh_geofilt? like this? /select?q= : &fq= {!gh_geofilt} &sort=geodist() asc&sfield=store_hash&d=10
          Hide
          Harley Parks added a comment -

          Sorry, about the editing.

          But Thanks For the Feedback.

          so perhaps there is something wrong with my configuration, since the search results do not return lat, long but the geohash string.

          The maven build went great.

          I did finally figure out how to use the geofilt function using the pt and d.

          I am reindexing each time, but yes, I delete the data folder, and reindex.

          I am placing the jar file into the tomcats solr/lib folder, after a restart, and after changing solrconfig and schema, geohash string is displayed, not the lat,long.
          Schema
          Field Type:
          <fieldType name="geohash" class="solr2155.solr.schema.GeoHashField" length="12"/>
          this is the field:
          <field name="GeoTagGeoHash" type="geohash" indexed="true" stored="true" multiValued="true" />

          this is the info from solr/admin/ field types - GEOHASH
          Field Type: geohash
          Fields: GEOTAGGEOHASH
          Tokenized: true
          Class Name: solr2155.solr.schema.GeoHashField
          Index Analyzer: org.apache.solr.analysis.TokenizerChain
          Tokenizer Class: solr2155.solr.schema.GeoHashField$1
          Query Analyzer: org.apache.solr.schema.FieldType$DefaultAnalyzer

          Still, the query returns values like:

          <arr name="GeoTagGeoHash">
          <str>rw3sh9g8c6mx</str>
          <str>rw3f3xc9dnh3</str>
          <str>rw3ckbue74y7</str>
          </arr>

          so, if this is not the right, is there anything I can do to troubleshoot?

          Show
          Harley Parks added a comment - Sorry, about the editing. But Thanks For the Feedback. so perhaps there is something wrong with my configuration, since the search results do not return lat, long but the geohash string. The maven build went great. I did finally figure out how to use the geofilt function using the pt and d. I am reindexing each time, but yes, I delete the data folder, and reindex. I am placing the jar file into the tomcats solr/lib folder, after a restart, and after changing solrconfig and schema, geohash string is displayed, not the lat,long. Schema Field Type: <fieldType name="geohash" class="solr2155.solr.schema.GeoHashField" length="12"/> this is the field: <field name="GeoTagGeoHash" type="geohash" indexed="true" stored="true" multiValued="true" /> this is the info from solr/admin/ field types - GEOHASH Field Type: geohash Fields: GEOTAGGEOHASH Tokenized: true Class Name: solr2155.solr.schema.GeoHashField Index Analyzer: org.apache.solr.analysis.TokenizerChain Tokenizer Class: solr2155.solr.schema.GeoHashField$1 Query Analyzer: org.apache.solr.schema.FieldType$DefaultAnalyzer Still, the query returns values like: <arr name="GeoTagGeoHash"> <str>rw3sh9g8c6mx</str> <str>rw3f3xc9dnh3</str> <str>rw3ckbue74y7</str> </arr> so, if this is not the right, is there anything I can do to troubleshoot?
          Hide
          David Smiley added a comment -

          Harley,
          You shouldn't need to know a thing about geohashes to use SOLR-2155. You use it identically to LatLonType insofar as you add the data in lat,lon format and get it out the same way in search results, and you can use the built-in geofilt query parser (my gh_geofilt is has other options and is optional). Perhaps you are inadvertently using Solr's geohash field? And/or maybe you forgot to reindex?

          p.s. please use the JIRA comment edit feature sparingly; it sends interested parties notifications each time.

          Show
          David Smiley added a comment - Harley, You shouldn't need to know a thing about geohashes to use SOLR-2155 . You use it identically to LatLonType insofar as you add the data in lat,lon format and get it out the same way in search results, and you can use the built-in geofilt query parser (my gh_geofilt is has other options and is optional). Perhaps you are inadvertently using Solr's geohash field? And/or maybe you forgot to reindex? p.s. please use the JIRA comment edit feature sparingly; it sends interested parties notifications each time.
          Hide
          Harley Parks added a comment -

          Wait... there is GeoHashUtils.decode(geohashString)... I wonder if I can create a field in solr that returns lat, lng pair?

          Show
          Harley Parks added a comment - Wait... there is GeoHashUtils.decode(geohashString)... I wonder if I can create a field in solr that returns lat, lng pair?
          Hide
          Harley Parks added a comment - - edited

          okay... so, I think I have it working...

          &q=:&fq=

          {!geofilt}

          &sfield=GeoTagGeoHash&pt=21.5,-158&d=0.9

          the only item that I need to do now is write a c# conversion from geohash to lat, long.

          Show
          Harley Parks added a comment - - edited okay... so, I think I have it working... &q= : &fq= {!geofilt} &sfield=GeoTagGeoHash&pt=21.5,-158&d=0.9 the only item that I need to do now is write a c# conversion from geohash to lat, long.
          Hide
          Harley Parks added a comment - - edited

          For some reason
          package solr2155.lucene.spatial.geometry.shape;
          is miss named
          and some other issues with the build.. but I'm trying to use eclipse with a maven build.
          and might be missing something else...
          so, downloaded Maven, and jdk 6, setup JAVA_HOME path, added Maven bin to PATH, unzip and cd to Solr2155-1.0.3-project, in cmd window, executed "mvn package", and it built nicely...
          then added Solr2155-1.0.3.jar to the tomcat/solr/lib,
          followed the readme.txt file instructions to update the solr schema. so now it is working... and the GeoHash field no longer shows a lat,long but a geohash... is that expected?
          example:
          <doc>
          <float name="score">1.0</float>
          <arr name="GeoTagGeoHash">
          <str>87zdk9gyt4kz</str>

          Show
          Harley Parks added a comment - - edited For some reason package solr2155.lucene.spatial.geometry.shape; is miss named and some other issues with the build.. but I'm trying to use eclipse with a maven build. and might be missing something else... so, downloaded Maven, and jdk 6, setup JAVA_HOME path, added Maven bin to PATH, unzip and cd to Solr2155-1.0.3-project, in cmd window, executed "mvn package", and it built nicely... then added Solr2155-1.0.3.jar to the tomcat/solr/lib, followed the readme.txt file instructions to update the solr schema. so now it is working... and the GeoHash field no longer shows a lat,long but a geohash... is that expected? example: <doc> <float name="score">1.0</float> <arr name="GeoTagGeoHash"> <str>87zdk9gyt4kz</str>
          Hide
          Harley Parks added a comment - - edited

          Fantastic!
          it does get confusing with the different versions, patches, and issues.

          and in the light shed here, I can see that i also need to add the plugin, and then the example query should work too.

          notes on the wiki, bottom of http://wiki.apache.org/solr/SpatialSearch
          and in the filter section of: http://wiki.apache.org/solr/SpatialSearchDev may be helpful.

          however, the addendum will be the most help by making the above information explicit.

          For example: for solr 3.4 to add the geohash filter plugin is it
          <queryParser name="geo" class="solr.SpatialGeoHashFilterQParser$Plugin" />
          or
          <queryParser name="geo" class="solr.SpatialGeohashFilterQParser$Plugin" />
          (readme.txt has: <queryParser name="gh_geofilt" class="solr2155.solr.search.SpatialGeoHashFilterQParser$Plugin" />)
          or something else... geohashfilt, since class is not found in either case.

          another question on the query, is geohashfilt (or as indicated in queryParser Name, gh_geofilt) used same as the geofilt?

          ...wait. just reread, I would still need to build the jar file from the patch.
          the plugin is not built into the solr 3.x
          so, build it, drop in the jar file, add in the config, and stir
          thank you.

          Show
          Harley Parks added a comment - - edited Fantastic! it does get confusing with the different versions, patches, and issues. and in the light shed here, I can see that i also need to add the plugin, and then the example query should work too. notes on the wiki, bottom of http://wiki.apache.org/solr/SpatialSearch and in the filter section of: http://wiki.apache.org/solr/SpatialSearchDev may be helpful. however, the addendum will be the most help by making the above information explicit. For example: for solr 3.4 to add the geohash filter plugin is it <queryParser name="geo" class="solr.SpatialGeoHashFilterQParser$Plugin" /> or <queryParser name="geo" class="solr.SpatialGeohashFilterQParser$Plugin" /> (readme.txt has: <queryParser name="gh_geofilt" class="solr2155.solr.search.SpatialGeoHashFilterQParser$Plugin" />) or something else... geohashfilt, since class is not found in either case. another question on the query, is geohashfilt (or as indicated in queryParser Name, gh_geofilt) used same as the geofilt? ...wait. just reread, I would still need to build the jar file from the patch. the plugin is not built into the solr 3.x so, build it, drop in the jar file, add in the config, and stir thank you.
          Hide
          David Smiley added a comment -

          Harley,
          Apparently I haven't been clear because this question does come up often, and I sympathize with you all because the comments on this issue are ridiculously long. What I should have done and still can do is add info to the Solr wiki. Ever since ~September 2011, you no longer need to patch Solr and you can use any 3x release. The specific comment with further info announcing this is:
          https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13117350&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13117350
          If you look at the attachments to the issue, you'll notice the latest version is 1.0.3.

          When Solr 3.6 comes out (soon!), my co-author and I will write an online addendum to discuss the changes in Solr 3.5 & Solr 3.6 that affect the content of the book, or are interesting things that we would have written about if we were still writing it. I'll add a clarification to the existing info box on 144 mentioning SOLR-2155 that this feature is available in plugin form to 3x without patching Solr.

          Show
          David Smiley added a comment - Harley, Apparently I haven't been clear because this question does come up often, and I sympathize with you all because the comments on this issue are ridiculously long. What I should have done and still can do is add info to the Solr wiki. Ever since ~September 2011, you no longer need to patch Solr and you can use any 3x release. The specific comment with further info announcing this is: https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13117350&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13117350 If you look at the attachments to the issue, you'll notice the latest version is 1.0.3. When Solr 3.6 comes out (soon!), my co-author and I will write an online addendum to discuss the changes in Solr 3.5 & Solr 3.6 that affect the content of the book, or are interesting things that we would have written about if we were still writing it. I'll add a clarification to the existing info box on 144 mentioning SOLR-2155 that this feature is available in plugin form to 3x without patching Solr.
          Hide
          Harley Parks added a comment - - edited

          So, basic question.. perhaps needs to be posted else where.
          I'm working with Solr 3.4, using the GeoHash to store multiple locations for a document.
          if geofilt or geodist doesn't work with the GeoHash, is the only way to add this patch into solr 3.4?
          I'm using a tomcat and solr, and jumping to 4.0 might be a while, even if it's released soon.

          I'm not real clear how to apply the patch, as I would need to create the solr.war file from the source... and compile all of the other sources... painful, but once setup, perhaps rewarding.

          Ideally, I would have a jar file from the patch, that I drop into the solr/lib and make the needed changes to the config files.

          So, I'm real interested in getting something stable, so i'm watching the above mentioned links.

          David Smiley, would you be able to amend your book - Apache Solr 3 ESS, which mentions solr-2155, to include how to implement this patch? or do I need to get brave, and build the source?

          Show
          Harley Parks added a comment - - edited So, basic question.. perhaps needs to be posted else where. I'm working with Solr 3.4, using the GeoHash to store multiple locations for a document. if geofilt or geodist doesn't work with the GeoHash, is the only way to add this patch into solr 3.4? I'm using a tomcat and solr, and jumping to 4.0 might be a while, even if it's released soon. I'm not real clear how to apply the patch, as I would need to create the solr.war file from the source... and compile all of the other sources... painful, but once setup, perhaps rewarding. Ideally, I would have a jar file from the patch, that I drop into the solr/lib and make the needed changes to the config files. So, I'm real interested in getting something stable, so i'm watching the above mentioned links. David Smiley, would you be able to amend your book - Apache Solr 3 ESS, which mentions solr-2155, to include how to implement this patch? or do I need to get brave, and build the source?
          Hide
          David Smiley added a comment -

          If someone watching this issue has an interest in this capability winding its way into Solr out of the box, then I suggest you vote (and maybe "watch") LUCENE-3795. That issue is the first step, the subsequent step is a follow-on issue that will bring LSP's spatial-solr module which uses spatial-lucene (LUCENE-3795). I don't intend or support committing SOLR-2155 as is. Spatial done-right should involve a good framework; SOLR-2155 isn't a framework and Lucene's existing defunct spatial-contrib module isn't good. That's where LSP comes in, and LUCENE-3795 is the first step to get it incorporated into Lucene/Solr.

          Show
          David Smiley added a comment - If someone watching this issue has an interest in this capability winding its way into Solr out of the box, then I suggest you vote (and maybe "watch") LUCENE-3795 . That issue is the first step, the subsequent step is a follow-on issue that will bring LSP's spatial-solr module which uses spatial-lucene ( LUCENE-3795 ). I don't intend or support committing SOLR-2155 as is. Spatial done-right should involve a good framework; SOLR-2155 isn't a framework and Lucene's existing defunct spatial-contrib module isn't good. That's where LSP comes in, and LUCENE-3795 is the first step to get it incorporated into Lucene/Solr.
          Hide
          Bill Bell added a comment -

          David,

          We really need to get this into the trunk. What is left to do? Would really love to use this, but until it is moved into the trunk most clients won't.

          Maybe we can scale it back a bit, add the feature and iterate?

          Show
          Bill Bell added a comment - David, We really need to get this into the trunk. What is left to do? Would really love to use this, but until it is moved into the trunk most clients won't. Maybe we can scale it back a bit, add the feature and iterate?
          Hide
          Srikanth Kallurkar added a comment -

          Aha! Thanks for the explaination.

          Show
          Srikanth Kallurkar added a comment - Aha! Thanks for the explaination.
          Hide
          David Smiley added a comment -

          Srikanth,

          1. According to the table on wikipedia, a geohash length of 8 translates to +- 19 meters error. This means that if you do a filter query, it may match points that are <=19 meters outside of the query shape erroneously, or it may exclude points <= 19 meters within the query shape erroneously. The accuracy is associated to the indexing of the data, not the query shape.
          2. At query time, if you only do geospatial filters, then there is no RAM requirement for this field. If you sort by distance or influence relevancy by distance, then the indexed points are all brought into memory "inverted" so that the center of the query shape can be compared against all indexed points for documents matching the query. LatLonType does the same thing for the same reason, but this code is multiValue aware, returning the closest indexed point a document has to the query shape. When a commit happens on Solr, this in-memory cache is unused and garbage collected, and the data is brought into memory again.
          Show
          David Smiley added a comment - Srikanth, According to the table on wikipedia, a geohash length of 8 translates to +- 19 meters error. This means that if you do a filter query, it may match points that are <=19 meters outside of the query shape erroneously, or it may exclude points <= 19 meters within the query shape erroneously. The accuracy is associated to the indexing of the data, not the query shape. At query time, if you only do geospatial filters, then there is no RAM requirement for this field. If you sort by distance or influence relevancy by distance, then the indexed points are all brought into memory "inverted" so that the center of the query shape can be compared against all indexed points for documents matching the query. LatLonType does the same thing for the same reason, but this code is multiValue aware, returning the closest indexed point a document has to the query shape. When a commit happens on Solr, this in-memory cache is unused and garbage collected, and the data is brought into memory again.
          Hide
          Srikanth Kallurkar added a comment -

          Hi David, Apologies for taking a long time to reply. I incorporated your suggestions and did see some speedup. I started with geohash length at 12 and got reasonable speed up for 8. T
          But I am bit confused by (1) accuracy associated with length and (2) the query time operation in this patch.
          (1) So, is the accuracy associated with how big/small an enclosing box is?
          (2) The entire geohash field is loaded into memory at query time. Is this done because for the lat-lon comparisons, the patch cannot use lucene string matching mechanisms. Also, because the whole field is in memory, how are updates to the index handled. Meaning, if new lat-lons are added, do they get added to the in-memory geohashes.

          Thanks,
          Srikanth

          Show
          Srikanth Kallurkar added a comment - Hi David, Apologies for taking a long time to reply. I incorporated your suggestions and did see some speedup. I started with geohash length at 12 and got reasonable speed up for 8. T But I am bit confused by (1) accuracy associated with length and (2) the query time operation in this patch. (1) So, is the accuracy associated with how big/small an enclosing box is? (2) The entire geohash field is loaded into memory at query time. Is this done because for the lat-lon comparisons, the patch cannot use lucene string matching mechanisms. Also, because the whole field is in memory, how are updates to the index handled. Meaning, if new lat-lons are added, do they get added to the in-memory geohashes. Thanks, Srikanth
          Hide
          David Smiley added a comment -

          Srikanth: If I were you, I would increase ramBufferSizeMB in solrconfig.xml to a good amount – perhaps to as much as 256MB. Remember to do this in <mainIndex>, NOT <indexDefaults>. Secondly and most importantly, I would configure the geohash length attribute on the field type to have sufficient search detail for your needs, but no more. Remember, making the geohash shorter and thus courser granularity doesn't mean you "lose" the accuracy in your stored value – which is retained verbatim as you provided. For a guide on what geohash length to use, consult the wikipedia page which has a table. Please let me know how this works out for you.

          Show
          David Smiley added a comment - Srikanth: If I were you, I would increase ramBufferSizeMB in solrconfig.xml to a good amount – perhaps to as much as 256MB. Remember to do this in <mainIndex>, NOT <indexDefaults>. Secondly and most importantly, I would configure the geohash length attribute on the field type to have sufficient search detail for your needs, but no more. Remember, making the geohash shorter and thus courser granularity doesn't mean you "lose" the accuracy in your stored value – which is retained verbatim as you provided. For a guide on what geohash length to use, consult the wikipedia page which has a table. Please let me know how this works out for you.
          Hide
          Srikanth Kallurkar added a comment -

          In my use case, I have a large number of lat-lons for each document - on the order of about 2K lat-lon pairs. Since the time we started using geohash prefix filter, the time to index has significantly degraded - by about 2-3 times. Are there any suggestions for speeding up the indexing process. I was trying to read the comments here, but am not sure if any index time caching mechanism is used (or could be used) to lookup geohashes.

          Thanks,
          Srikanth

          Show
          Srikanth Kallurkar added a comment - In my use case, I have a large number of lat-lons for each document - on the order of about 2K lat-lon pairs. Since the time we started using geohash prefix filter, the time to index has significantly degraded - by about 2-3 times. Are there any suggestions for speeding up the indexing process. I was trying to read the comments here, but am not sure if any index time caching mechanism is used (or could be used) to lookup geohashes. Thanks, Srikanth
          Hide
          David Smiley added a comment -

          Oliver: Your scenario is interesting but I wouldn't recommend spatial for that. A key part of spatial is the use of a numerical range. In your case there are discrete values. Instead, I recommend you experiment with phrase queries, and if you are expert in Lucene then span queries. As a toy hack example, imagine indexing each of these values in the form "senior developer java" (3 words, one for each part). We assume each value tokenizes as one token. Then search for "the developer java" in which "the" was substituted as a kind of wildcard for the first position to find java developers in all levels of experience. "The" is a stopword and in effect creates a wildcard placeholder. If you search the solr-user list then you will see information on this topic. I've solved this problem in a different more difficult way because my values were not single tokens, but based on the example you present, the solution I present here isn't bad. If you want to discuss this further I recommend the solr-user list.

          Show
          David Smiley added a comment - Oliver: Your scenario is interesting but I wouldn't recommend spatial for that. A key part of spatial is the use of a numerical range. In your case there are discrete values. Instead, I recommend you experiment with phrase queries, and if you are expert in Lucene then span queries. As a toy hack example, imagine indexing each of these values in the form "senior developer java" (3 words, one for each part). We assume each value tokenizes as one token. Then search for "the developer java" in which "the" was substituted as a kind of wildcard for the first position to find java developers in all levels of experience. "The" is a stopword and in effect creates a wildcard placeholder. If you search the solr-user list then you will see information on this topic. I've solved this problem in a different more difficult way because my values were not single tokens, but based on the example you present, the solution I present here isn't bad. If you want to discuss this further I recommend the solr-user list.
          Hide
          Olivier Jacquet added a comment - - edited

          I just wanted to mention another use case for multivalued point fields since everyone is always talking about this in a location context.

          The PointType can also be used to categorize other other stuff. In my case we're storing qualifications of persons as a tuple of experience, function and skill (eg. senior, developer, java) which are internally represented by numerical ids. Now with Solr I would like to be able to do the query: "return everything that is a java developer" which would be the same as asking for all points on a certain line.

          Show
          Olivier Jacquet added a comment - - edited I just wanted to mention another use case for multivalued point fields since everyone is always talking about this in a location context. The PointType can also be used to categorize other other stuff. In my case we're storing qualifications of persons as a tuple of experience, function and skill (eg. senior, developer, java) which are internally represented by numerical ids. Now with Solr I would like to be able to do the query: "return everything that is a java developer" which would be the same as asking for all points on a certain line.
          Hide
          David Smiley added a comment -

          The patch file is against Solr trunk (v4). It is a little out of date but it should be easy to port it. Perhaps only some import statements would change due to some moves; I'm not sure. The Solr 4 solution I've been working on with Chris & Ryan is a larger spatial framework called LSP (a temporary name): http://code.google.com/p/lucene-spatial-playground/ Feel free to email me about LSP directly for assistance with it. Development on LSP has slowed but will pick up a lot within a few weeks.

          Show
          David Smiley added a comment - The patch file is against Solr trunk (v4). It is a little out of date but it should be easy to port it. Perhaps only some import statements would change due to some moves; I'm not sure. The Solr 4 solution I've been working on with Chris & Ryan is a larger spatial framework called LSP (a temporary name): http://code.google.com/p/lucene-spatial-playground/ Feel free to email me about LSP directly for assistance with it. Development on LSP has slowed but will pick up a lot within a few weeks.
          Hide
          arin ghazarian added a comment -

          Hi David,
          i am interested in using this geohash prefix filtering/bbox feature in solr 4.x with solrcloud, do you have any plans to convert this plugin to a solr 4 compatible one?
          Thanks,
          arin

          Show
          arin ghazarian added a comment - Hi David, i am interested in using this geohash prefix filtering/bbox feature in solr 4.x with solrcloud, do you have any plans to convert this plugin to a solr 4 compatible one? Thanks, arin
          Hide
          David Smiley added a comment -

          Frederick, a rough inspection of your problem suggests that the GeoHashField is declared multiValue=true but the field in your POJO is not correspondingly a List<String> like it should be. If you only need a single value then I suggest you use LatLonType instead, since it's what comes with Solr.

          Show
          David Smiley added a comment - Frederick, a rough inspection of your problem suggests that the GeoHashField is declared multiValue=true but the field in your POJO is not correspondingly a List<String> like it should be. If you only need a single value then I suggest you use LatLonType instead, since it's what comes with Solr.
          Hide
          Frederick N. Brier added a comment -

          I am new to Solr/Solandra so it is not clear to me how to declare a POJO with the @Field annotation for a GeoHashField. If I declare the property as a String, it seems to parse it and store it (the below value is correct). But the query response fails when it attempts to marshal the GeoHashField data back into a String:

          Exception while setting value : [32.76932462118566,-79.92890948429704] on java.lang.String mybuilding.latLong
          IllegalArgumentException: Can not set java.lang.String field mybuilding.latLong to java.util.ArrayList

          Perhaps the property should be declared as a GeoHashField or a LatLonType to be properly marshaled and unmarshaled, but with my unfamiliarity with Solr, I do not know how to store and retrieve the value from those types in my getter/setter. Thank you for any explanation on how to declare the POJO.

          Show
          Frederick N. Brier added a comment - I am new to Solr/Solandra so it is not clear to me how to declare a POJO with the @Field annotation for a GeoHashField. If I declare the property as a String, it seems to parse it and store it (the below value is correct). But the query response fails when it attempts to marshal the GeoHashField data back into a String: Exception while setting value : [32.76932462118566,-79.92890948429704] on java.lang.String mybuilding.latLong IllegalArgumentException: Can not set java.lang.String field mybuilding.latLong to java.util.ArrayList Perhaps the property should be declared as a GeoHashField or a LatLonType to be properly marshaled and unmarshaled, but with my unfamiliarity with Solr, I do not know how to store and retrieve the value from those types in my getter/setter. Thank you for any explanation on how to declare the POJO.
          Hide
          David Smiley added a comment -

          Thanks Mikhail; I've uploaded a new version with these changes. I tweaked the formatting and another trivial thing or two.

          I don't see the point in explicitly configuring the cache named "fieldValueCache" since Solr will create it for you automatically with reasonable defaults. But I kept your tip in the README any way.

          Show
          David Smiley added a comment - Thanks Mikhail; I've uploaded a new version with these changes. I tweaked the formatting and another trivial thing or two. I don't see the point in explicitly configuring the cache named "fieldValueCache" since Solr will create it for you automatically with reasonable defaults. But I kept your tip in the README any way.
          Hide
          Mikhail Khludnev added a comment -

          Hi,
          Solr2155-for-1.0.2-3.x-port.patch has the small amendments for the backport:

          1. exception text for the absent sfield local param;
          2. add cache enabling recommendation into README.txt (cache name is confusing a little)
          3. fix for UnsupportedOpEx on debugQuery=on for geodist func (but my toString() impl seems overcomplicated)

          David,
          Please let me know if I can apply this into any codebase.

          Thanks for backport!

          Show
          Mikhail Khludnev added a comment - Hi, Solr2155-for-1.0.2-3.x-port.patch has the small amendments for the backport: exception text for the absent sfield local param; add cache enabling recommendation into README.txt (cache name is confusing a little) fix for UnsupportedOpEx on debugQuery=on for geodist func (but my toString() impl seems overcomplicated) David, Please let me know if I can apply this into any codebase. Thanks for backport!
          Hide
          David Smiley added a comment -

          I ported SOLR-2155 to Solr 3.x and did so in a manner that plugs into an unpatched Solr. Any source that the patch modified was copied and moved into another package so I could keep this capability independent. The attached zip Solr2155-1.0.2-project.zip. Is a maven based project including .git/ for history. You'll need to run "mvn package" to generate a jar file that you can throw into your classpath. There is skimpy README.txt that tells you want to do to your schema & solrconfig files. With this in place, you have multi-value geospatial filter & sort for indexed points. And if you use my query parser then you get explicit bounding box query filter capability.

          Show
          David Smiley added a comment - I ported SOLR-2155 to Solr 3.x and did so in a manner that plugs into an unpatched Solr. Any source that the patch modified was copied and moved into another package so I could keep this capability independent. The attached zip Solr2155-1.0.2-project.zip. Is a maven based project including .git/ for history. You'll need to run "mvn package" to generate a jar file that you can throw into your classpath. There is skimpy README.txt that tells you want to do to your schema & solrconfig files. With this in place, you have multi-value geospatial filter & sort for indexed points. And if you use my query parser then you get explicit bounding box query filter capability.
          Hide
          David Smiley added a comment -

          Your use-case is a feature I have intended to have LSP address in a direct manner when I have time. In the mean time, there are a couple approaches that should work.

          The first approach that comes to mind is to use the LSP QuadPrefixTree with LSP's ability to index rectangles. You would treat the x dimension as time, and ignore the y dimension (use 0). What helps make this possible is LSP's unique ability to index shapes other than points, and in an efficient manner. The only spatial filter query operation that LSP supports right now is an intersection. If your query is simply a point (a specific time) then this is fine, or if it is a time duration and you want all stores that were open for at least part of this time, then it's fine. If your query is a time duration and you want it to reside completely within an indexed time duration, then no-can-do for now. Based on the nature of your use-case, it may suffice to use multiple spatial filter queries, each one a point (time) at each hour interval of the desired query duration.

          The second approach is similar to your suggestion but for y = closing time, not the delta. y should always be > x. I just did some sample Venn diagrams to verify this approach. If you want to find documents with an indexed duration that completely overlaps your query time, then you do a bounding box filter query from x=0-starttime and y=endtime-max (where max is the maximum indexable time). When you initialize the LSP QuadPrefixTree you need to tell it the range of values. Some time ago when writing tests, I discovered it simply can't handle Double.MAX_VALUE, but I imagine it will handle your 30,000. If you want to use this patch (SOLR-2155) and not LSP then you will instead have to map your times to latitude-longitude ranges and use a Geohash grid length with granularity sufficient to differentiate your smallest unit of time (5min).

          I think the 2nd approach is simplest and ideal based on what you've said about your needs.

          If you want help with LSP then email me directly: david.w.smiley@gmail.com

          Show
          David Smiley added a comment - Your use-case is a feature I have intended to have LSP address in a direct manner when I have time. In the mean time, there are a couple approaches that should work. The first approach that comes to mind is to use the LSP QuadPrefixTree with LSP's ability to index rectangles. You would treat the x dimension as time, and ignore the y dimension (use 0). What helps make this possible is LSP's unique ability to index shapes other than points, and in an efficient manner. The only spatial filter query operation that LSP supports right now is an intersection. If your query is simply a point (a specific time) then this is fine, or if it is a time duration and you want all stores that were open for at least part of this time, then it's fine. If your query is a time duration and you want it to reside completely within an indexed time duration, then no-can-do for now. Based on the nature of your use-case, it may suffice to use multiple spatial filter queries, each one a point (time) at each hour interval of the desired query duration. The second approach is similar to your suggestion but for y = closing time, not the delta. y should always be > x. I just did some sample Venn diagrams to verify this approach. If you want to find documents with an indexed duration that completely overlaps your query time, then you do a bounding box filter query from x=0-starttime and y=endtime-max (where max is the maximum indexable time). When you initialize the LSP QuadPrefixTree you need to tell it the range of values. Some time ago when writing tests, I discovered it simply can't handle Double.MAX_VALUE, but I imagine it will handle your 30,000. If you want to use this patch ( SOLR-2155 ) and not LSP then you will instead have to map your times to latitude-longitude ranges and use a Geohash grid length with granularity sufficient to differentiate your smallest unit of time (5min). I think the 2nd approach is simplest and ideal based on what you've said about your needs. If you want help with LSP then email me directly: david.w.smiley@gmail.com
          Hide
          geert-jan brits added a comment -

          David,

          I try not to swamp this discussion, but I have a totally different issue for which I might misuse this patch / LSP.

          It's about pois having multiple openinghours (depending on day of week, special festivitydays, and sometimes even multiple timeslots per day)
          I want to query, for example, all pois that are open NOW, and that will remain open until NOW+3H.

          For background see: http://lucene.472066.n3.nabble.com/multiple-dateranges-timeslots-per-doc-modeling-openinghours-td3368790.html on why all normal approaches don't work (afaik): basically it's about needing multiple opening/closing times and having them be pairwise related.

          I have the feeling that opening/closing datetimes might be modelled as multiple lat/long points. But I would need a query of the form:

          Given a user defined point x, return all docs that have a point p defined for which:

          • x.latitude > p.latitude
          • x.longitude < p.longitude

          Is this possible? (As far as I see GeoFilt, BBox, GeoDist don't provide me with what I need)

          Basically this is how I envision encoding it:

          • each <open,closedelta)>-tuple is represented as a (lat/long)point
          • open is matched on latitude
          • closedelta (closedelta is represented as delta from open) is matched on longitude
          • granularity is 5 minutes
          • open can be a max of 100 days in future -> ~30.000 distinct values.
          • closedelta can be at most 24 hours -> ~300 distinct values

          The above lat/long query applied to the domain would become:
          Given a user defined open/closedelta-datetime x, return all docs that have a open/close-datetime p defined for which:

          • x.open > p.open (poi is already open at requested opening time)
          • x.closedelta < p.closedelta (poi is not yet closed on the requested closing time)

          In other words, the poi is open from the requested open-datetime until at least the requested close-datetime.

          Ok, good exercise in writing this down, the question remains is this query possible (perhaps with some coding-efforts)?

          Thanks,
          Geert-Jan

          Show
          geert-jan brits added a comment - David, I try not to swamp this discussion, but I have a totally different issue for which I might misuse this patch / LSP. It's about pois having multiple openinghours (depending on day of week, special festivitydays, and sometimes even multiple timeslots per day) I want to query, for example, all pois that are open NOW, and that will remain open until NOW+3H. For background see: http://lucene.472066.n3.nabble.com/multiple-dateranges-timeslots-per-doc-modeling-openinghours-td3368790.html on why all normal approaches don't work (afaik): basically it's about needing multiple opening/closing times and having them be pairwise related. I have the feeling that opening/closing datetimes might be modelled as multiple lat/long points. But I would need a query of the form: Given a user defined point x, return all docs that have a point p defined for which: x.latitude > p.latitude x.longitude < p.longitude Is this possible? (As far as I see GeoFilt, BBox, GeoDist don't provide me with what I need) Basically this is how I envision encoding it: each <open,closedelta)>-tuple is represented as a (lat/long)point open is matched on latitude closedelta (closedelta is represented as delta from open) is matched on longitude granularity is 5 minutes open can be a max of 100 days in future -> ~30.000 distinct values. closedelta can be at most 24 hours -> ~300 distinct values The above lat/long query applied to the domain would become: Given a user defined open/closedelta-datetime x, return all docs that have a open/close-datetime p defined for which: x.open > p.open (poi is already open at requested opening time) x.closedelta < p.closedelta (poi is not yet closed on the requested closing time) In other words, the poi is open from the requested open-datetime until at least the requested close-datetime. Ok, good exercise in writing this down, the question remains is this query possible (perhaps with some coding-efforts)? Thanks, Geert-Jan
          Hide
          geert-jan brits added a comment - - edited

          Great thanks, I believe you interpretation of my use-case is correct .
          I will go the Multi-point route first, without the polygons.

          Just to clarify: I realize I added to the confusion by bringing polygons to the table where they aren't necessary for the problem I described.
          I did this because I thought that perhaps "distance of point to polygon' was implemented in LSP, while 'distance of point to collection of points' was not.

          In that case 'transforming the problem space' by representing a 'collection of points' as a polygon and querying for "distance of point to polygon" instead would have given me what I wanted. This is all superfluous now, because doing 'distance of point to collection of points' IS possible.

          I will check out the code, thanks again!

          Show
          geert-jan brits added a comment - - edited Great thanks, I believe you interpretation of my use-case is correct . I will go the Multi-point route first, without the polygons. Just to clarify: I realize I added to the confusion by bringing polygons to the table where they aren't necessary for the problem I described. I did this because I thought that perhaps "distance of point to polygon' was implemented in LSP, while 'distance of point to collection of points' was not. In that case 'transforming the problem space' by representing a 'collection of points' as a polygon and querying for "distance of point to polygon" instead would have given me what I wanted. This is all superfluous now, because doing 'distance of point to collection of points' IS possible. I will check out the code, thanks again!
          Hide
          David Smiley added a comment -

          You mention: "Sorting by (multi-value) indexed shapes is supported only for points". Does this mean that representation 1.) above is supported? It wasn't entirely clear for me from your response.

          Yes, it does. This patch & LSP support sorting documents by a multi-value point field.

          Based on the description of your use-case and my interpretation of what you say (which I am not 100% sure of), I don't think it's pertinent that the "walks" are polygons. What is pertinent is that they are a collection of points (Pois). So if your "walk" corresponds to a Solr document, you could have a "poi" field that is an indexed multi-value geospatial point field. Then you could sort walks (documents) according to the pois that are closest to a user-specified point in the query. I would add a large bounding-box filter which would improve performance. It's not clear to me there is any need for polygon support.

          FYI I use the well-known JTS library for polygon support.

          Show
          David Smiley added a comment - You mention: "Sorting by (multi-value) indexed shapes is supported only for points". Does this mean that representation 1.) above is supported? It wasn't entirely clear for me from your response. Yes, it does. This patch & LSP support sorting documents by a multi-value point field. Based on the description of your use-case and my interpretation of what you say (which I am not 100% sure of), I don't think it's pertinent that the "walks" are polygons. What is pertinent is that they are a collection of points (Pois). So if your "walk" corresponds to a Solr document, you could have a "poi" field that is an indexed multi-value geospatial point field. Then you could sort walks (documents) according to the pois that are closest to a user-specified point in the query. I would add a large bounding-box filter which would improve performance. It's not clear to me there is any need for polygon support. FYI I use the well-known JTS library for polygon support.
          Hide
          geert-jan brits added a comment - - edited

          David, to clarify:
          My use-case could be either represented as:
          1. a bag of points, in which case I want to be able to return the closest point to a user-defined point and sort on the distance
          2. a polygon made of the points (where the points are the vertices of the polygon) and return the closest distance from a user-defined point to the polygon.

          Either of the solutions suffices for me, from your answer I can't entirely see if that was clear.

          You mention: "Sorting by (multi-value) indexed shapes is supported only for points".
          Does this mean that representation 1.) above is supported? It wasn't entirely clear for me from your response.

          Let me give you the use-case, (and why the sort on center-point / centroid is not going to work):

          Consider a travel application in which walks/itineraries can be defined. Most of the walks are defined as roundtrips (i.e: beginpoint = endpoint). In my representation (for now) a walk visits certain Points of interest (poi) (which each have a lat/long point defined) in a certain order.

          A lot of walks can be started at any given Poi. (bc. of the roundtrip nature).
          I want a user to be able to request walks that are nearby. (sorted based on distance). For each walk the distance becomes the closest Poi (thus point) defined in the walk related to the user-defined point.

          Does this make sense?

          P.s: having only though of representing this problem as polygons to support the 'find closest point'-query, I skimmed over the fact that for my notion of a walk (ordered collection of points) , connecting the points in the order specified may generate a complex (self-intersecting) polygon. Are these polygons supported in the LSP?

          Show
          geert-jan brits added a comment - - edited David, to clarify: My use-case could be either represented as: 1. a bag of points, in which case I want to be able to return the closest point to a user-defined point and sort on the distance 2. a polygon made of the points (where the points are the vertices of the polygon) and return the closest distance from a user-defined point to the polygon. Either of the solutions suffices for me, from your answer I can't entirely see if that was clear. You mention: "Sorting by (multi-value) indexed shapes is supported only for points". Does this mean that representation 1.) above is supported? It wasn't entirely clear for me from your response. Let me give you the use-case, (and why the sort on center-point / centroid is not going to work): Consider a travel application in which walks/itineraries can be defined. Most of the walks are defined as roundtrips (i.e: beginpoint = endpoint). In my representation (for now) a walk visits certain Points of interest (poi) (which each have a lat/long point defined) in a certain order. A lot of walks can be started at any given Poi. (bc. of the roundtrip nature). I want a user to be able to request walks that are nearby. (sorted based on distance). For each walk the distance becomes the closest Poi (thus point) defined in the walk related to the user-defined point. Does this make sense? P.s: having only though of representing this problem as polygons to support the 'find closest point'-query, I skimmed over the fact that for my notion of a walk (ordered collection of points) , connecting the points in the order specified may generate a complex (self-intersecting) polygon. Are these polygons supported in the LSP?
          Hide
          David Smiley added a comment -

          geert-jan, your impression of the capabilities of the code is correct including your use-case for the most part. Your use-case describes that the document might have an indexed polygon – that is supported in LSP but not this patch. Sorting by (multi-value) indexed shapes is supported only for points, not yet other shapes like polygons. But you could basically get this by indexing an additional multi-point field containing the polygon vertices and center-point, and then sorting on that. If you have a particular sorting use-case with nuances that my solution does not cover then please let me know.

          Show
          David Smiley added a comment - geert-jan, your impression of the capabilities of the code is correct including your use-case for the most part. Your use-case describes that the document might have an indexed polygon – that is supported in LSP but not this patch. Sorting by (multi-value) indexed shapes is supported only for points, not yet other shapes like polygons. But you could basically get this by indexing an additional multi-point field containing the polygon vertices and center-point, and then sorting on that. If you have a particular sorting use-case with nuances that my solution does not cover then please let me know.
          Hide
          geert-jan brits added a comment -

          I have the impression that this code is meant for drawing shapes and to see if geospatial enriched documents are within this shape. Is that correct?

          Perhaps my use-case is also supported, because it's in the 'multi-geopoint domain' as well.

          I envision documents having multiple lat/long points. I would like to query (sort / filter on) documents by their 'closest point' to a given user-defined lat/long point. Documents would either contain a bag of lat/long pairs or a polygon made up out of these lat/long pairs and the query would become: return the closest distance from a user-defined point to the polygon.

          Before delving in the above code or in the LSP-code myself, perhaps someone can say if this type of querying is supported?

          Show
          geert-jan brits added a comment - I have the impression that this code is meant for drawing shapes and to see if geospatial enriched documents are within this shape. Is that correct? Perhaps my use-case is also supported, because it's in the 'multi-geopoint domain' as well. I envision documents having multiple lat/long points. I would like to query (sort / filter on) documents by their 'closest point' to a given user-defined lat/long point. Documents would either contain a bag of lat/long pairs or a polygon made up out of these lat/long pairs and the query would become: return the closest distance from a user-defined point to the polygon. Before delving in the above code or in the LSP-code myself, perhaps someone can say if this type of querying is supported?
          Hide
          David Smiley added a comment -

          Bill, have you checked out LSP?: http://code.google.com/p/lucene-spatial-playground/ A couple folks have been kicking the tires last week. We've got some benchmarking and (more) testing to do, and there's still the polygon date-line & pole wrapping feature-gap that hasn't been implemented yet. There's always more I want to do with it but its in decent shape now. It can even index shapes with area (i.e. not just points but other shapes) – the only Lucene/Solr native implementation that I know of, Ryan too.

          Show
          David Smiley added a comment - Bill, have you checked out LSP?: http://code.google.com/p/lucene-spatial-playground/ A couple folks have been kicking the tires last week. We've got some benchmarking and (more) testing to do, and there's still the polygon date-line & pole wrapping feature-gap that hasn't been implemented yet. There's always more I want to do with it but its in decent shape now. It can even index shapes with area (i.e. not just points but other shapes) – the only Lucene/Solr native implementation that I know of, Ryan too.
          Hide
          William Bell added a comment -

          OK... Would it ask too much to commit this? It is very stable and works very well.

          I know that there might be a new version coming out, but isn't that always the case?

          I would love more people the weigh in... Vote?

          Show
          William Bell added a comment - OK... Would it ask too much to commit this? It is very stable and works very well. I know that there might be a new version coming out, but isn't that always the case? I would love more people the weigh in... Vote?
          Hide
          Grant Ingersoll added a comment -

          I don't think we are abandoning it, I think David and Ryan, etc. have decided to bake things in the playground for a bit and then may move it back. Lance, the code is in Google Code, search the archives.

          Show
          Grant Ingersoll added a comment - I don't think we are abandoning it, I think David and Ryan, etc. have decided to bake things in the playground for a bit and then may move it back. Lance, the code is in Google Code, search the archives.
          Hide
          Bill Bell added a comment -

          Why are we abandoning this? I thought it was a good enhancement. I need this feature to be committed so that I can do multiple points per row.

          We can mark it experimental?

          Show
          Bill Bell added a comment - Why are we abandoning this? I thought it was a good enhancement. I need this feature to be committed so that I can do multiple points per row. We can mark it experimental?
          Hide
          Lance Norskog added a comment -

          Where is "lucene-spatial-playground"?

          Show
          Lance Norskog added a comment - Where is "lucene-spatial-playground"?
          Hide
          Grant Ingersoll added a comment -

          If the intent is to bring in the "lucene-spatial-playground" into the ASF, why not just start a branch? It will make provenance so much easier.

          Show
          Grant Ingersoll added a comment - If the intent is to bring in the "lucene-spatial-playground" into the ASF, why not just start a branch? It will make provenance so much easier.
          Hide
          Lance Norskog added a comment -

          Excellent! Geo is a complex topic, too big for a one-man project.

          Lance

          Show
          Lance Norskog added a comment - Excellent! Geo is a complex topic, too big for a one-man project. Lance
          Hide
          David Smiley added a comment -

          To anyone listening: I'll continue to support my latest patch here with any bug fixes or basic things. As of today I'll principally be working directly with Ryan McKinley on his "lucene-spatial-playground" code-base. He ported my patch to this framework as the predominant means of searching for points (single or multi-value) and I'm going to finish what he started. This new framework is superior to the geospatial mess in Lucene/Solr right now (no offense to any involved). It won't be long before it's ready for broad use as a replacement for anything existing. I look forward to exploring new indexing techniques with this framework, and for it to eventually become part of Lucene/Solr.

          Show
          David Smiley added a comment - To anyone listening: I'll continue to support my latest patch here with any bug fixes or basic things. As of today I'll principally be working directly with Ryan McKinley on his "lucene-spatial-playground" code-base. He ported my patch to this framework as the predominant means of searching for points (single or multi-value) and I'm going to finish what he started. This new framework is superior to the geospatial mess in Lucene/Solr right now (no offense to any involved). It won't be long before it's ready for broad use as a replacement for anything existing. I look forward to exploring new indexing techniques with this framework, and for it to eventually become part of Lucene/Solr.
          Hide
          David Smiley added a comment -

          Attached is a new patch. The highlights are:

          • Requires the latest Solr trunk – probably anything in the last few months: If this is ultimately going to get committed then this needed to happen. There are only some slight differences so if you really need an earlier trunk then I'm sure you'll figure it out.
          • Adds support for sorting, including multi-value: Use the existing geodist() function query with a lat-lon constant and a reference to your geohash based field. Note that this works by loading all points from the field into memory, resolving each underlying full-length geohash into the lat & lon into a data structure which is a List<Point2D>[]. This is improved over Bill's patch, surely, but it could use some optimization. It's not optimized for the single-value case either; that's a definite TODO.
          • Polygon/WKT features have been omitted due to LGPL licensing concerns of JTS. I've left hooks for their implementation to make adding on this capability that already existed easy. You'll easily figure it out if you are so inclined. I might ad this as a patch shortly (not to be committed) when I get some time; but longer term it will re-surface under a separate project. Don't worry; it'll be painless to use if you need it.
          • This might be controversial but as part of this patch, I removed the ghhsin() and geohash() function queries. Their presence was confusing; I simply don't see what point there is too them now that this patch fleshes out the geohash capability.
          • I decided to pre-register my "SpatialGeoHashFilterQParser" as "geohashfilt", instead of requiring you to do so in solrconfig.xml. You could use "geofilt" for point-radius queries but I prefer this one since I can specify the bbox explicitly.

          There are a few slight changes to GeoHashPrefixFilter that crept in from unfinished work (notably tying sorting to filtering in an efficient way) but it is harmless.

          Bill, thanks for kick-starting the multi-value sorting. I re-used most of your code.

          Show
          David Smiley added a comment - Attached is a new patch. The highlights are: Requires the latest Solr trunk – probably anything in the last few months: If this is ultimately going to get committed then this needed to happen. There are only some slight differences so if you really need an earlier trunk then I'm sure you'll figure it out. Adds support for sorting, including multi-value: Use the existing geodist() function query with a lat-lon constant and a reference to your geohash based field. Note that this works by loading all points from the field into memory, resolving each underlying full-length geohash into the lat & lon into a data structure which is a List<Point2D>[]. This is improved over Bill's patch, surely, but it could use some optimization. It's not optimized for the single-value case either; that's a definite TODO. Polygon/WKT features have been omitted due to LGPL licensing concerns of JTS. I've left hooks for their implementation to make adding on this capability that already existed easy. You'll easily figure it out if you are so inclined. I might ad this as a patch shortly (not to be committed) when I get some time; but longer term it will re-surface under a separate project. Don't worry; it'll be painless to use if you need it. This might be controversial but as part of this patch, I removed the ghhsin() and geohash() function queries. Their presence was confusing; I simply don't see what point there is too them now that this patch fleshes out the geohash capability. I decided to pre-register my "SpatialGeoHashFilterQParser" as "geohashfilt", instead of requiring you to do so in solrconfig.xml. You could use "geofilt" for point-radius queries but I prefer this one since I can specify the bbox explicitly. There are a few slight changes to GeoHashPrefixFilter that crept in from unfinished work (notably tying sorting to filtering in an efficient way) but it is harmless. Bill, thanks for kick-starting the multi-value sorting. I re-used most of your code.
          Hide
          Ryan McKinley added a comment -

          Congratulations on the new baby!

          Thinking about spatial support in general, I think we should settle on some basic APIs and approaches that can be used across many indexing strategies. In http://code.google.com/p/lucene-spatial-playground/ I'm messing with how we can use a standard API to index Shapes with various strategies. As always, each stratagey has its tradeoffs, but if we can keep the high level APIs similar, that makes choosing the right approach easier. In this project I'm looking at indexing shaps as:

          • bounding box – 4 fields xmin/xmax/ymin./ymax
          • prefix grids – like geohash or csquars
          • in memory spatial index (rtree/quadtree)
          • raw WKB geometry tokens
          • points – x,y fields
          • etc

          To keep things coherent, I'm proposing a high level interface like:
          https://lucene-spatial-playground.googlecode.com/svn/trunk/spatial-lucene/src/main/java/org/apache/lucene/spatial/search/SpatialQueryBuilder.java

          And then each implementation fills it in:
          https://lucene-spatial-playground.googlecode.com/svn/trunk/spatial-lucene/src/main/java/org/apache/lucene/spatial/search/prefix/PrefixGridQueryBuilder.java

          This solr to just handle setup and configuration:
          http://lucene-spatial-playground.googlecode.com/svn/trunk/spatial-solr/src/main/java/org/apache/solr/spatial/prefix/SpatialPrefixGridFieldType.java

          In my view geohash is a subset of 'spatial prefix grid' (is there a real name for this?) – the interface i'm proposing is:
          http://lucene-spatial-playground.googlecode.com/svn/trunk/spatial-base/src/main/java/org/apache/lucene/spatial/base/prefix/SpatialPrefixGrid.java
          essentially:

            public List<CharSequence> readCells( Shape geo );
          

          Geohash for a point would just be a list of one token – for a polygon, it would be a collection of tokens that fill the space like csquares

          I aim to get this basic structure in a lucene branch and maybe into trunk in the next few weeks....

          Show
          Ryan McKinley added a comment - Congratulations on the new baby! Thinking about spatial support in general, I think we should settle on some basic APIs and approaches that can be used across many indexing strategies. In http://code.google.com/p/lucene-spatial-playground/ I'm messing with how we can use a standard API to index Shapes with various strategies. As always, each stratagey has its tradeoffs, but if we can keep the high level APIs similar, that makes choosing the right approach easier. In this project I'm looking at indexing shaps as: bounding box – 4 fields xmin/xmax/ymin./ymax prefix grids – like geohash or csquars in memory spatial index (rtree/quadtree) raw WKB geometry tokens points – x,y fields etc To keep things coherent, I'm proposing a high level interface like: https://lucene-spatial-playground.googlecode.com/svn/trunk/spatial-lucene/src/main/java/org/apache/lucene/spatial/search/SpatialQueryBuilder.java And then each implementation fills it in: https://lucene-spatial-playground.googlecode.com/svn/trunk/spatial-lucene/src/main/java/org/apache/lucene/spatial/search/prefix/PrefixGridQueryBuilder.java This solr to just handle setup and configuration: http://lucene-spatial-playground.googlecode.com/svn/trunk/spatial-solr/src/main/java/org/apache/solr/spatial/prefix/SpatialPrefixGridFieldType.java In my view geohash is a subset of 'spatial prefix grid' (is there a real name for this?) – the interface i'm proposing is: http://lucene-spatial-playground.googlecode.com/svn/trunk/spatial-base/src/main/java/org/apache/lucene/spatial/base/prefix/SpatialPrefixGrid.java essentially: public List<CharSequence> readCells( Shape geo ); Geohash for a point would just be a list of one token – for a polygon, it would be a collection of tokens that fill the space like csquares I aim to get this basic structure in a lucene branch and maybe into trunk in the next few weeks....
          Hide
          David Smiley added a comment -

          I plan to finish a couple improvements to this patch within 2 weeks time: distance function queries to work with multi-value, and polygon queries that span the date line. I've been delayed by some life events (new baby). Furthermore, I'll try and ensure that the work here is applicable to pure Lucene users (i.e. sans Solr).

          One thing I'm unsure of is how to integrate (or not integrate) existing Lucene & Solr spatial code with this patch. In this patch I chose to re-use some basic shape classes in Lucene's spatial contrib simply because they were already there, but I could just as easily of not. My preference going forward would be to outright replace Lucene's spatial contrib with this patch. I also think LatLonType and PointType could become deprecated since this patch is not only more capable (multiValue support) but faster too. Well with filtering, sorting is TBD. I'm also inclined to name the field type LatLonGeohashType to re-enforce the fact that it works with lat & lon; geohash is an implementation detail. In the future it might even not be geohash, strictly speaking, once we optimize the encoding.

          Show
          David Smiley added a comment - I plan to finish a couple improvements to this patch within 2 weeks time: distance function queries to work with multi-value, and polygon queries that span the date line. I've been delayed by some life events (new baby). Furthermore, I'll try and ensure that the work here is applicable to pure Lucene users (i.e. sans Solr). One thing I'm unsure of is how to integrate (or not integrate) existing Lucene & Solr spatial code with this patch. In this patch I chose to re-use some basic shape classes in Lucene's spatial contrib simply because they were already there, but I could just as easily of not. My preference going forward would be to outright replace Lucene's spatial contrib with this patch. I also think LatLonType and PointType could become deprecated since this patch is not only more capable (multiValue support) but faster too. Well with filtering, sorting is TBD. I'm also inclined to name the field type LatLonGeohashType to re-enforce the fact that it works with lat & lon; geohash is an implementation detail. In the future it might even not be geohash, strictly speaking, once we optimize the encoding.
          Hide
          Chris Male added a comment -

          In LUCENE-2599 I deprecated the spatial contrib. The problem is as Robert raises, deprecating the code without providing an alternative isn't that user friendly. I think as part of this issue we should start up the spatial module and work towards moving what we can there. Moving function queries is going to take some time since they are very coupled to Solr. But that shouldn't preclude us from putting into the module what we can. Once we have a module that provides a reasonable set of functionality, then we can deprecate/gut/remove the spatial contrib.

          Show
          Chris Male added a comment - In LUCENE-2599 I deprecated the spatial contrib. The problem is as Robert raises, deprecating the code without providing an alternative isn't that user friendly. I think as part of this issue we should start up the spatial module and work towards moving what we can there. Moving function queries is going to take some time since they are very coupled to Solr. But that shouldn't preclude us from putting into the module what we can. Once we have a module that provides a reasonable set of functionality, then we can deprecate/gut/remove the spatial contrib.
          Hide
          Robert Muir added a comment -

          well what would the deprecation have suggested as an alternative?

          Show
          Robert Muir added a comment - well what would the deprecation have suggested as an alternative?
          Hide
          Grant Ingersoll added a comment -

          Yeah, I agree. I haven't looked at the patch yet. It was my understanding that Chris Male was going to move lucene/contrib/spatial to modules and gut the broken stuff in it. I think there is a separate issue open for that one. Presumably, once spatial and function queries are moved to modules, then we will have a properly working spatial package.

          I obviously can move it, but I don't have time to do the gutting (we really should have deprecated the tier stuff for this release).

          Show
          Grant Ingersoll added a comment - Yeah, I agree. I haven't looked at the patch yet. It was my understanding that Chris Male was going to move lucene/contrib/spatial to modules and gut the broken stuff in it. I think there is a separate issue open for that one. Presumably, once spatial and function queries are moved to modules, then we will have a properly working spatial package. I obviously can move it, but I don't have time to do the gutting (we really should have deprecated the tier stuff for this release).
          Hide
          Robert Muir added a comment -

          I don't really think things like this (queries etc) should go into just Solr, while we leave the lucene-contrib spatial package broken.

          Lets put things in the right places?

          Show
          Robert Muir added a comment - I don't really think things like this (queries etc) should go into just Solr, while we leave the lucene-contrib spatial package broken. Lets put things in the right places?
          Hide
          Bill Bell added a comment -

          Grant!! Game plan to get this committed?

          Show
          Bill Bell added a comment - Grant!! Game plan to get this committed?
          Hide
          Bill Bell added a comment - - edited

          Lance,

          Thanks. But in order to use PointType I need the ability to append another parameter to the suffix for the lat,long pair.

          71	    suffixes = new String[dimension];
          72	    for (int i=0; i<dimension; i++) {
          73	      suffixes[i] = "_" + i + suffix;
          74	    }
          

          This would add "geohash_1_<suffix>" and "geohash_2_<suffix>" for a 2 dimensional field (Lat,Long). If I add 2 values in the same name, it will just overwrite the field (as is the case now)...

          I personally don't like the way this was done. It focuses on the dimension, where I need to focus on the number of multiValue pairs. I guess we could do something like the following as we build the array:

          dimension=2... That is static for lat,long and can be in the schema.xml. I need to add another number to pair these.

          suffixes[] = multivalue_index + "_" + i + suffix

          43.5614,-90.67341|30.44614,-91.60341|35.0752,-97.202 would instead be:

          geohash_0_0_suffix = 43.5614
          geohash_0_1_suffix = -90.67341
          geohash_1_0_suffix = 30.44614
          geohash_1_1_suffix = -91.60341
          geohash_2_0_suffix = 35.0752
          geohash_2_1_suffix = -97.202

          Does it also make sense to have:
          geohash_num_suffix = 2 (the number of multivalue pairs in this document).

          I toyed with having a maxmultivalue=10, but thought that would be pretty inefficient. This should not be static, since the number of pairs of lat,long - could go from 1 to 25 on each document.

          I could easily take PointType.java and create a new MultiPointType.java and add these to the createFields() for the document.

          I might be missing things that also need to be done to support something like this. I might make sense to just extend PointType.java to work with mutliValued types. I don't want to break anything else.

          Bill

          Show
          Bill Bell added a comment - - edited Lance, Thanks. But in order to use PointType I need the ability to append another parameter to the suffix for the lat,long pair. 71 suffixes = new String [dimension]; 72 for ( int i=0; i<dimension; i++) { 73 suffixes[i] = "_" + i + suffix; 74 } This would add "geohash_1_<suffix>" and "geohash_2_<suffix>" for a 2 dimensional field (Lat,Long). If I add 2 values in the same name, it will just overwrite the field (as is the case now)... I personally don't like the way this was done. It focuses on the dimension, where I need to focus on the number of multiValue pairs. I guess we could do something like the following as we build the array: dimension=2... That is static for lat,long and can be in the schema.xml. I need to add another number to pair these. suffixes[] = multivalue_index + "_" + i + suffix 43.5614,-90.67341|30.44614,-91.60341|35.0752,-97.202 would instead be: geohash_0_0_suffix = 43.5614 geohash_0_1_suffix = -90.67341 geohash_1_0_suffix = 30.44614 geohash_1_1_suffix = -91.60341 geohash_2_0_suffix = 35.0752 geohash_2_1_suffix = -97.202 Does it also make sense to have: geohash_num_suffix = 2 (the number of multivalue pairs in this document). I toyed with having a maxmultivalue=10, but thought that would be pretty inefficient. This should not be static, since the number of pairs of lat,long - could go from 1 to 25 on each document. I could easily take PointType.java and create a new MultiPointType.java and add these to the createFields() for the document. I might be missing things that also need to be done to support something like this. I might make sense to just extend PointType.java to work with mutliValued types. I don't want to break anything else. Bill
          Hide
          Lance Norskog added a comment -

          It would be better if you copied each lat long into the index with a prefix added to the sfield.

          This is what SOLR-1311 does. Here's the code for PointType: PointType.java

          BTW, Bill: thanks for all the work on this. Geo is hard stuff and having real users helps.

          Show
          Lance Norskog added a comment - It would be better if you copied each lat long into the index with a prefix added to the sfield. This is what SOLR-1311 does. Here's the code for PointType: PointType.java BTW, Bill: thanks for all the work on this. Geo is hard stuff and having real users helps.
          Hide
          Bill Bell added a comment -

          Test cases for geomultidist() function.

          Add this and SOLR.2155.p3.patch

          Show
          Bill Bell added a comment - Test cases for geomultidist() function. Add this and SOLR.2155.p3.patch
          Hide
          Bill Bell added a comment - - edited

          This is the patch with some speed improvements.

          Example call:

          
          http://localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfieldmulti=storemv&pt=43.17614,-90.57341&d=100&sfield=store&sort=geomultidist%28%29%20asc&sfieldmultidir=asc
          
          

          This addresses/fixes:

          3. Use DistanceUtils for hsin
          4. Remove split() to improve performance

          Show
          Bill Bell added a comment - - edited This is the patch with some speed improvements. Example call: http: //localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfieldmulti=storemv&pt=43.17614,-90.57341&d=100&sfield=store&sort=geomultidist%28%29%20asc&sfieldmultidir=asc This addresses/fixes: 3. Use DistanceUtils for hsin 4. Remove split() to improve performance
          Hide
          Bill Bell added a comment -

          I did more research. You cannot get from doc to multiple values in the cache for a field. It does not exist for what I can see. The "docToTermOrd" property (type Direct8) is an array that is indexed by the document ID, and has one value (the term ord). It does not appear to be easy to get a list since there is one value. This was created to easily count the number of documents for facets (does it have 1 or more). I could do something like the following (but it would be really slow).

          Document doc = searcher.doc(id, fields);

          It would be better if you copied each lat long into the index with a prefix added to the sfield. Like "store_1", "store_2", "store_3", when you index the values. Then I can grab them easily. Of course you could also just sore them in one field like that I did but name it store_1 : "lat,lon|lat,lon". If we did this during indexing it would make it easier for people to use (not having to copy it) with bars. Asking for 2,3,4 term lists by document ID is probably slower than just doing the "|" separation.

          I keep going back to my patch, and I think it is still pretty good. I hope others have not went down this same path, since it was not fun.

          Improvements potential:

          1. Auto populate sfieldmulti when indexing geohash field into "|"
          2. Multi-thread the brute force looking for lat longs
          3. Use DistanceUtils for hsin
          4. Remove split() to improve performance

          Bill

          Show
          Bill Bell added a comment - I did more research. You cannot get from doc to multiple values in the cache for a field. It does not exist for what I can see. The "docToTermOrd" property (type Direct8) is an array that is indexed by the document ID, and has one value (the term ord). It does not appear to be easy to get a list since there is one value. This was created to easily count the number of documents for facets (does it have 1 or more). I could do something like the following (but it would be really slow). Document doc = searcher.doc(id, fields); It would be better if you copied each lat long into the index with a prefix added to the sfield. Like "store_1", "store_2", "store_3", when you index the values. Then I can grab them easily. Of course you could also just sore them in one field like that I did but name it store_1 : "lat,lon|lat,lon". If we did this during indexing it would make it easier for people to use (not having to copy it) with bars. Asking for 2,3,4 term lists by document ID is probably slower than just doing the "|" separation. I keep going back to my patch, and I think it is still pretty good. I hope others have not went down this same path, since it was not fun. Improvements potential: 1. Auto populate sfieldmulti when indexing geohash field into "|" 2. Multi-thread the brute force looking for lat longs 3. Use DistanceUtils for hsin 4. Remove split() to improve performance Bill
          Hide
          David Smiley added a comment -

          There is no committer assigned, as you can see. After sorting (less so polygon), I suspect it'll do enough to get enough committer interest.

          1. Sorting without a geo filter does present a challenge; this is something I've been thinking about. However, haversine is only evaluated for each matching result. If there aren't many, then it isn't too bad. If there are many, then the only thing I can think of would be to try and only get the distance for points in a geo-hash box filter at the query center, assuming you're only looking at the top-10 results. If there aren't enough results in the box to fill the top-10 then you could either recursively expand the geo-hash box or give-up on being smart and traverse the remaining matched documents. Figuring out how to guess a suitable initial box size might be tricky.
          2. My latest geohash field indexes at every intermediate resolution. So if you were looking through the index values looking for the actual full-detail points, you'd need to filter out those that aren't long enough.
          3. You're only getting one value out of the field cache because each term/geohash (i.e. point) is a separate value. I confess to not having coded with the field cache and value sources yet. It has first class support for single-value per document but multi-value was added later and I don't yet know what's involved.

          Show
          David Smiley added a comment - There is no committer assigned, as you can see. After sorting (less so polygon), I suspect it'll do enough to get enough committer interest. 1. Sorting without a geo filter does present a challenge; this is something I've been thinking about. However, haversine is only evaluated for each matching result. If there aren't many, then it isn't too bad. If there are many, then the only thing I can think of would be to try and only get the distance for points in a geo-hash box filter at the query center, assuming you're only looking at the top-10 results. If there aren't enough results in the box to fill the top-10 then you could either recursively expand the geo-hash box or give-up on being smart and traverse the remaining matched documents. Figuring out how to guess a suitable initial box size might be tricky. 2. My latest geohash field indexes at every intermediate resolution. So if you were looking through the index values looking for the actual full-detail points, you'd need to filter out those that aren't long enough. 3. You're only getting one value out of the field cache because each term/geohash (i.e. point) is a separate value. I confess to not having coded with the field cache and value sources yet. It has first class support for single-value per document but multi-value was added later and I don't yet know what's involved.
          Hide
          Bill Bell added a comment -

          OK I looked into getting the geo_hash and convert to LatLon... I still can only get one. Not sure how to get more than one. The other issue is I get "9qv" and not the full geohash. Would be nice to know how to do this.

          ValueSource vs = sf.getType().getValueSource(sf, fp);
          DocValues llVals = vs.getValues(context, reader);
          org.apache.lucene.spatial.geohash.GeoHashUtils.decode(llVals.strVal(doc));

          The strVal(doc) only returns one value, and the value is not fully qualified - it seems it is a tokenized (Ngram) version.

          Thanks.

          Show
          Bill Bell added a comment - OK I looked into getting the geo_hash and convert to LatLon... I still can only get one. Not sure how to get more than one. The other issue is I get "9qv" and not the full geohash. Would be nice to know how to do this. ValueSource vs = sf.getType().getValueSource(sf, fp); DocValues llVals = vs.getValues(context, reader); org.apache.lucene.spatial.geohash.GeoHashUtils.decode(llVals.strVal(doc)); The strVal(doc) only returns one value, and the value is not fully qualified - it seems it is a tokenized (Ngram) version. Thanks.
          Hide
          Bill Bell added a comment -

          David,

          Who is the committer assigned to this ticket?

          I agree that I need to find a way to get at the MultiValue fields for LatLon and GeoHash. I will start working on trying to get that to work. Probably first with GeoHash since you already deal with multiValue (gazateer). LatLon does not handle multi values yet. Based on your patch it seems like I can do.

          I like the parallel idea too.

          I finally see your point. The use case is when you do a geodist() without a

          {!geofilt}

          . This would be very slow since you would be looping through a large result set. However if you were to throw away far points, you would need to know the "d=" parameter (making that mandatory). The current implementation does not require "radius". I also think your geohash combined with geomultidist() is awesome. We just need to try to make it faster.

          I will continue working on this - since my client needs it now. Just tell me what I need to do, what you recommend, use cases, etc. If we can get a committer involved, it might get included in 4.0.

          Bill

          Show
          Bill Bell added a comment - David, Who is the committer assigned to this ticket? I agree that I need to find a way to get at the MultiValue fields for LatLon and GeoHash. I will start working on trying to get that to work. Probably first with GeoHash since you already deal with multiValue (gazateer). LatLon does not handle multi values yet. Based on your patch it seems like I can do. I like the parallel idea too. I finally see your point. The use case is when you do a geodist() without a {!geofilt} . This would be very slow since you would be looping through a large result set. However if you were to throw away far points, you would need to know the "d=" parameter (making that mandatory). The current implementation does not require "radius". I also think your geohash combined with geomultidist() is awesome. We just need to try to make it faster. I will continue working on this - since my client needs it now. Just tell me what I need to do, what you recommend, use cases, etc. If we can get a committer involved, it might get included in 4.0. Bill
          Hide
          David Smiley added a comment -

          Bill,
          It would be nice if the sorting didn't require a separate field than the geohash field since the geohash field already has the data required. That was the main point of my criticism RE using a character to separate the values. I know how to modify your code accordingly but that's not really the interesting part of our conversation.

          I am aware of how geodist() works and that you're algorithm is conceptually very similar. But just because geodist() works this way and was written by Solr committers doesn't make it fast. It loads every field value into RAM via Lucene's field cache and then does a brute force scan across all values to see if it's within the shape (a haversine based circle). Then, yes, it only sorts on the remainder. Pretty simple. More evidence that this is suboptimal is some trends to parallelize the brute-force scan into multiple threads (AFAIK JTeam does this and I believe geodist() is planned to though I forget where I saw that). The brute-force aspect of it is what I find most uninspiring; the RAM might not be so much a problem but still.

          I know you can't use geohash for sort (except for approximation) but it can help filter the data set so that you don't compute haversine for points in geohash boxes that you know aren't within the queried box. The fewer points in the queried box is relative to the entire globe of points will yield better performance. That's the central idea I present. And I'm not talking about precision loss. I have a month of other stuff to get to then I can get to this, to include benchmarks.

          Show
          David Smiley added a comment - Bill, It would be nice if the sorting didn't require a separate field than the geohash field since the geohash field already has the data required. That was the main point of my criticism RE using a character to separate the values. I know how to modify your code accordingly but that's not really the interesting part of our conversation. I am aware of how geodist() works and that you're algorithm is conceptually very similar. But just because geodist() works this way and was written by Solr committers doesn't make it fast. It loads every field value into RAM via Lucene's field cache and then does a brute force scan across all values to see if it's within the shape (a haversine based circle). Then, yes, it only sorts on the remainder. Pretty simple. More evidence that this is suboptimal is some trends to parallelize the brute-force scan into multiple threads (AFAIK JTeam does this and I believe geodist() is planned to though I forget where I saw that). The brute-force aspect of it is what I find most uninspiring; the RAM might not be so much a problem but still. I know you can't use geohash for sort (except for approximation) but it can help filter the data set so that you don't compute haversine for points in geohash boxes that you know aren't within the queried box. The fewer points in the queried box is relative to the entire globe of points will yield better performance. That's the central idea I present. And I'm not talking about precision loss. I have a month of other stuff to get to then I can get to this, to include benchmarks.
          Hide
          Bill Bell added a comment -

          David,

          THis seems to be pretty fast since the results are constrained by d=<km> first, and then finding the closest points by distance from pt. It is at least as fast at geodist(). geodist() uses the same algorithm and if you were to duplicate the lat,long in separate rows, you would be searching on the same number of fields. The one area we could improve performance would be in the split() regex call. We could put them into separate fields to speed that up, but I am not an expert on the API to get dynamic fields. For example: <dynamicField name="storemv_*" type="string" indexed="true" stored="true"/>. My question is: "what is the API call to get the fields stored for a document beginning with "storemv_" ? If we do that we can use a copy field for lat,long values.

          I copied the Haversine function that Grant added in ./java/org/apache/solr/search/function/distance/HaversineConstFunction.java, since I felt geodist() and geomultidist() could use the same distance calculation since it is named the same. But you are right we should just convert both programs to use the DistanceUtils class.

          I cannot see how we can get accurate distances using boxes (but you know more about geohash then I do), it would only be an approximation. The boxes work great for filtering. Then we need something to calculate the distance from pt to the value in the index. If you want to approximate the distance then boxes would work, but you kinda have that with the filter right? The use case that I am trying to solve is: Millions of locations. But the user only selects d=10,20,50, or 100 and these results are smaller than the overall population of points. Sort then by distances.

          There is a use case that says show me the top 100 closest documents, and I don't care about the exact order. You solved that already with the filter.

          I would vote for making geomultidist() work faster, but I need accurate distances. This code is pretty good, we can create a few test cases, and submit to be included since it works with LatLon and geohash... For LatLon this is pretty the best it gets.

          Bill

          Show
          Bill Bell added a comment - David, THis seems to be pretty fast since the results are constrained by d=<km> first, and then finding the closest points by distance from pt. It is at least as fast at geodist(). geodist() uses the same algorithm and if you were to duplicate the lat,long in separate rows, you would be searching on the same number of fields. The one area we could improve performance would be in the split() regex call. We could put them into separate fields to speed that up, but I am not an expert on the API to get dynamic fields. For example: <dynamicField name="storemv_*" type="string" indexed="true" stored="true"/>. My question is: "what is the API call to get the fields stored for a document beginning with "storemv_" ? If we do that we can use a copy field for lat,long values. I copied the Haversine function that Grant added in ./java/org/apache/solr/search/function/distance/HaversineConstFunction.java, since I felt geodist() and geomultidist() could use the same distance calculation since it is named the same. But you are right we should just convert both programs to use the DistanceUtils class. I cannot see how we can get accurate distances using boxes (but you know more about geohash then I do), it would only be an approximation. The boxes work great for filtering. Then we need something to calculate the distance from pt to the value in the index. If you want to approximate the distance then boxes would work, but you kinda have that with the filter right? The use case that I am trying to solve is: Millions of locations. But the user only selects d=10,20,50, or 100 and these results are smaller than the overall population of points. Sort then by distances. There is a use case that says show me the top 100 closest documents, and I don't care about the exact order. You solved that already with the filter. I would vote for making geomultidist() work faster, but I need accurate distances. This code is pretty good, we can create a few test cases, and submit to be included since it works with LatLon and geohash... For LatLon this is pretty the best it gets. Bill
          Hide
          David Smiley added a comment -

          Nice, Bill. Why are you asking for the field to be character delimited instead of asking for separate values (which translates to separate indexed terms)? And I noticed your patch included haversine code; were you unaware of the same code in a utility function in a DistanceUtils class (from memory)?

          Any way... I was thinking of this problem last night. The main challenge with distance sorting I see is scalability, not coming up with something that merely works. If the use-case is wanting to see the top X results out of potentially a million, then I think a fast solution would be code that only calculates that top X, and that leverages the geospatial index (geohashes). It could start with the boxes covering the filter area and then it could keep contracting the grid coverage area to the point that any further contraction wouldn't meet the desired top-X threshold. To do this efficiently, it needs a single filter bitset of all doc ids that are actually in the search results, and it needs to know the center of the user query, and the bounding box of the user query for its starting point. This might be pretty fast, but it wouldn't be very cacheable if further search refinements occur while keeping the same geospatial filter. So the code would be simpler if my filter here recognized that a sort is ultimately required which would cause it to go through every point (down to the full precision) and put the doc ids in a sorted list. That's probably the best approach, on balance.

          Show
          David Smiley added a comment - Nice, Bill. Why are you asking for the field to be character delimited instead of asking for separate values (which translates to separate indexed terms)? And I noticed your patch included haversine code; were you unaware of the same code in a utility function in a DistanceUtils class (from memory)? Any way... I was thinking of this problem last night. The main challenge with distance sorting I see is scalability, not coming up with something that merely works. If the use-case is wanting to see the top X results out of potentially a million, then I think a fast solution would be code that only calculates that top X, and that leverages the geospatial index (geohashes). It could start with the boxes covering the filter area and then it could keep contracting the grid coverage area to the point that any further contraction wouldn't meet the desired top-X threshold. To do this efficiently, it needs a single filter bitset of all doc ids that are actually in the search results, and it needs to know the center of the user query, and the bounding box of the user query for its starting point. This might be pretty fast, but it wouldn't be very cacheable if further search refinements occur while keeping the same geospatial filter. So the code would be simpler if my filter here recognized that a sort is ultimately required which would cause it to go through every point (down to the full precision) and put the doc ids in a sorted list. That's probably the best approach, on balance.
          Hide
          Bill Bell added a comment -

          New file. This works with geohash and normal LatLon. Here is an example with LatLon. The new field added is storemv. It is bar delimited. New fields:

          sfieldmultidir - asc or desc
          sfieldmulti - name of the field

          Can use for sorting or scoring. It will check all points in sfieldmulti field and find closest or farthest points.

          
          http://localhost:8983/solr/select?rows=1000&q=_val_:%22geomultidist%28%29%22&fl=storemv,score,store&fq={!geofilt}&sfieldmultidir=asc&sfieldmulti=storemv&pt=45.17614,-93.87341&d=10000&sfield=store&sort=geomultidist%28%29%20asc
          
          
          Show
          Bill Bell added a comment - New file. This works with geohash and normal LatLon. Here is an example with LatLon. The new field added is storemv. It is bar delimited. New fields: sfieldmultidir - asc or desc sfieldmulti - name of the field Can use for sorting or scoring. It will check all points in sfieldmulti field and find closest or farthest points. http: //localhost:8983/solr/select?rows=1000&q=_val_:%22geomultidist%28%29%22&fl=storemv,score,store&fq={!geofilt}&sfieldmultidir=asc&sfieldmulti=storemv&pt=45.17614,-93.87341&d=10000&sfield=store&sort=geomultidist%28%29%20asc
          Hide
          Bill Bell added a comment -

          Patch to add function geomultidist().

          This is to share with Grant and David. It is not finished. Add a string field to your schema.xml and name it whatever you want. This field will have a bar delimited list of lat/longs.

          <arr name="storemv">
          <str>43.17614,-90.57341|43.17614,-91.57341</str>
          </arr>

          Add a parameter called sfieldmulti to point to it.

          
          http://localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfield=store&pt=43.17614,-90.57341&d=10&sfieldmulti=storemv&sort=geomultidist%28%29%20asc
          

          Questions:

          • createWeight() purpose?
          • equals() purpose - how called?
          • hasCode() purpose ?
          • toString() and description() ?
          Show
          Bill Bell added a comment - Patch to add function geomultidist(). This is to share with Grant and David. It is not finished. Add a string field to your schema.xml and name it whatever you want. This field will have a bar delimited list of lat/longs. <arr name="storemv"> <str>43.17614,-90.57341|43.17614,-91.57341</str> </arr> Add a parameter called sfieldmulti to point to it. http: //localhost:8983/solr/select?q=*:*&fq={!geofilt}&sfield=store&pt=43.17614,-90.57341&d=10&sfieldmulti=storemv&sort=geomultidist%28%29%20asc Questions: createWeight() purpose? equals() purpose - how called? hasCode() purpose ? toString() and description() ?
          Hide
          David Smiley added a comment -

          So Bill's talking about sorting, and Lance is talking about polygons.

          Sorting: I'll try and get to it next; but this patch is low-priority for me at the moment.

          Polygons: Lance, I figured I could do a shifting of the coordinates off of the dateline already, but what was mentally hurting is contemplating a snake-like polygon that encircled the globe. And, a polygon that is for say Antarctica (a polygon covering a pole). The SLERP stuff is interesting but I don't know how to apply it. For someone who claims not to be a math guy, you're doing a good job fooling me

          Show
          David Smiley added a comment - So Bill's talking about sorting, and Lance is talking about polygons. Sorting: I'll try and get to it next; but this patch is low-priority for me at the moment. Polygons: Lance, I figured I could do a shifting of the coordinates off of the dateline already, but what was mentally hurting is contemplating a snake-like polygon that encircled the globe. And, a polygon that is for say Antarctica (a polygon covering a pole). The SLERP stuff is interesting but I don't know how to apply it. For someone who claims not to be a math guy, you're doing a good job fooling me
          Hide
          Lance Norskog added a comment -

          The lat/long version has to be rotated away from the "true". Then, the calculations don't blow up at the poles or the equator.

          The real answer to doing geo and have it always work is to use quaternions. A lat/lon pair is essentially a complex number: latitude is the scalar and longitude rotates back to 0. A quaternion is a 4-valued variation of complex numbers: "a + bi + cj + dk" where i,j,k are separate values of sqrt(-1), assuming an infinite number of such values. A geo position, projected onto quaternions, gives a subspace.

          There are a bunch of 3D algorithms which use quaternions because they don't have problems at the (0->1) boundary. The classic apocryphal story is the jet fighter pilot on a test flight: he crossed the equator and the plane flipped upside down. Quaternions don't have this problem.

          SLERP explains the problem of distance on a sphere. How to do distances, box containment, etc. I don't know. I am so not a math guy.

          Show
          Lance Norskog added a comment - The lat/long version has to be rotated away from the "true". Then, the calculations don't blow up at the poles or the equator. The real answer to doing geo and have it always work is to use quaternions. A lat/lon pair is essentially a complex number: latitude is the scalar and longitude rotates back to 0. A quaternion is a 4-valued variation of complex numbers: "a + bi + cj + dk" where i,j,k are separate values of sqrt(-1), assuming an infinite number of such values. A geo position, projected onto quaternions, gives a subspace. There are a bunch of 3D algorithms which use quaternions because they don't have problems at the (0->1) boundary. The classic apocryphal story is the jet fighter pilot on a test flight: he crossed the equator and the plane flipped upside down. Quaternions don't have this problem. SLERP explains the problem of distance on a sphere. How to do distances, box containment, etc. I don't know. I am so not a math guy.
          Hide
          Bill Bell added a comment -

          Lance, I don't understand this.

          How do you change the frame of reference for geohash? Is it in the conversion of geohash to Lat/Long? Do you have an example of doing this?

          http://en.wikipedia.org/wiki/Geohash and http://www.synchrosinteractive.com/blog/1-software/38-geohash
          For demo: http://openlocation.org/geohash/geohash-js/

          The bigger issue for most people will be to sort by distance... I think we should focus on that one.

          Show
          Bill Bell added a comment - Lance, I don't understand this. How do you change the frame of reference for geohash? Is it in the conversion of geohash to Lat/Long? Do you have an example of doing this? http://en.wikipedia.org/wiki/Geohash and http://www.synchrosinteractive.com/blog/1-software/38-geohash For demo: http://openlocation.org/geohash/geohash-js/ The bigger issue for most people will be to sort by distance... I think we should focus on that one.
          Hide
          Lance Norskog added a comment -

          I have yet to make the polygon work at the poles or dateline. I nearly got a headache trying to figure out how to make that work

          I believe we discussed this before. Rotate the coordinates out to sea:

          https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=12921761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12921761

          Happy Imbolc!

          Show
          Lance Norskog added a comment - I have yet to make the polygon work at the poles or dateline. I nearly got a headache trying to figure out how to make that work I believe we discussed this before. Rotate the coordinates out to sea: https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=12921761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12921761 Happy Imbolc!
          Hide
          Bill Bell added a comment -

          Instead of geodist() would you create a new one, or extend ghhsin(radius, hash1, hash2) ? I am not having a good time trying to get a list of valued from a multiValue field using a function like geodist()... I have another ticket open on trying to get that to work.

          It might be easier to do a simple field like: latlongCombined=43,-92;42,-91;44,-98 and then just getting distance on each of these to sort and get a distance returned (get closest on asc, farthest on desc) on each latlongCombined.

          Thoughts?

          Show
          Bill Bell added a comment - Instead of geodist() would you create a new one, or extend ghhsin(radius, hash1, hash2) ? I am not having a good time trying to get a list of valued from a multiValue field using a function like geodist()... I have another ticket open on trying to get that to work. It might be easier to do a simple field like: latlongCombined=43,-92;42,-91;44,-98 and then just getting distance on each of these to sort and get a distance returned (get closest on asc, farthest on desc) on each latlongCombined. Thoughts?
          Hide
          David Smiley added a comment -

          Hi Bill. I'm at O'Reilly Strata 2011 this week and so I have limited ability to help you until next Monday. My code so far is purely for filtering, not sorting/ranking. That's a TODO item. It wasn't a requirement for my geospatial app so far. In the mean time, limit your use to a filter query using any of geofilt, bbox, or my query parser.

          Show
          David Smiley added a comment - Hi Bill. I'm at O'Reilly Strata 2011 this week and so I have limited ability to help you until next Monday. My code so far is purely for filtering, not sorting/ranking. That's a TODO item. It wasn't a requirement for my geospatial app so far. In the mean time, limit your use to a filter query using any of geofilt, bbox, or my query parser.
          Hide
          Bill Bell added a comment -

          It might make sense to create a geohashdist() since geodist() only works on MultiValueSources not GeoHash.

          Show
          Bill Bell added a comment - It might make sense to create a geohashdist() since geodist() only works on MultiValueSources not GeoHash.
          Hide
          Bill Bell added a comment -

          OK. A couple things. If we know which lat,long was used for the result, that could be used for the geodist(). There seems to be an issue when you use geodist($pt,$sfield) when $sfield is a geohash.

          The most important is the sort parameter. It is easy enough to calculate the distance.

          sort=geodist($pt, $sfield)+asc needs to work. For example...

          data for a document has:
          score=45.15,-100.85
          score=45.15,-93.85

          1. If we switch the pt=45,-93.1, it should use distance to 45.15,-93.85 in the sort by geodist()
          2. If we switch the pt=45.15,-100.1, it should use distance to 45.15,-100.85 in the sort by geodist()

          You could do this by changing geodist() to work with geohash to find closest and return the distance from the list.

          Show
          Bill Bell added a comment - OK. A couple things. If we know which lat,long was used for the result, that could be used for the geodist(). There seems to be an issue when you use geodist($pt,$sfield) when $sfield is a geohash. The most important is the sort parameter. It is easy enough to calculate the distance. sort=geodist($pt, $sfield)+asc needs to work. For example... data for a document has: score=45.15,-100.85 score=45.15,-93.85 1. If we switch the pt=45,-93.1, it should use distance to 45.15,-93.85 in the sort by geodist() 2. If we switch the pt=45.15,-100.1, it should use distance to 45.15,-100.85 in the sort by geodist() You could do this by changing geodist() to work with geohash to find closest and return the distance from the list.
          Hide
          Bill Bell added a comment -

          Here are some examples:

          http://10.0.1.83:8983/solr/select?q=*:*&sfield=store&pt=45.15,-93.85&fq={!geofilt}&d=10
          This works
          
          http://10.0.1.83:8983/solr/select?q={!func}geodist%28%29&sfield=store&pt=45.15,-93.85&sort=score%20asc
            org.apache.lucene.queryParser.ParseException: Spatial field must implement MultiValueSource:store{type=geohash,properties=indexed,stored,omitTermFreqAndPositions,multiValued}
          
          http://10.0.1.83:8983/solr/select?q={!func}geodist%28%29&sfield=store&pt=45.15,-93.85&sort=score%20asc
            org.apache.lucene.queryParser.ParseException: Spatial field must implement MultiValueSource:store{type=geohash,properties=indexed,stored,omitTermFreqAndPositions,multiValued}
          
          http://10.0.1.83:8983/solr/select?q=*:*&sfield=store&pt=45.15,-93.85&fq={!geofilt}&d=10&sort=geodist%28%29%20asc
              Can't determine Sort Order: 'geodist() asc', pos=7
          

          Bill

          Show
          Bill Bell added a comment - Here are some examples: http: //10.0.1.83:8983/solr/select?q=*:*&sfield=store&pt=45.15,-93.85&fq={!geofilt}&d=10 This works http: //10.0.1.83:8983/solr/select?q={!func}geodist%28%29&sfield=store&pt=45.15,-93.85&sort=score%20asc org.apache.lucene.queryParser.ParseException: Spatial field must implement MultiValueSource:store{type=geohash,properties=indexed,stored,omitTermFreqAndPositions,multiValued} http: //10.0.1.83:8983/solr/select?q={!func}geodist%28%29&sfield=store&pt=45.15,-93.85&sort=score%20asc org.apache.lucene.queryParser.ParseException: Spatial field must implement MultiValueSource:store{type=geohash,properties=indexed,stored,omitTermFreqAndPositions,multiValued} http: //10.0.1.83:8983/solr/select?q=*:*&sfield=store&pt=45.15,-93.85&fq={!geofilt}&d=10&sort=geodist%28%29%20asc Can't determine Sort Order: 'geodist() asc', pos=7 Bill
          Hide
          Bill Bell added a comment -

          OK I got it to compile... Now getting this error:

          org.apache.lucene.queryParser.ParseException: Spatial field must implement MultiValueSource:store

          {type=geohash,properties=indexed,stored,omitTermFreqAndPositions,multiValued}
          Show
          Bill Bell added a comment - OK I got it to compile... Now getting this error: org.apache.lucene.queryParser.ParseException: Spatial field must implement MultiValueSource:store {type=geohash,properties=indexed,stored,omitTermFreqAndPositions,multiValued}
          Hide
          Bill Bell added a comment -

          WIll this version work?

          svn up -r 1052926 dev/trunk

          Show
          Bill Bell added a comment - WIll this version work? svn up -r 1052926 dev/trunk
          Hide
          Bill Bell added a comment -

          I noticed these are not in source control:

          svn status --verbose solr/src/test-files/solr/conf/solrconfig.xml
          svn status --verbose solr/src/test-files/solr/conf/schema.xml

          Show
          Bill Bell added a comment - I noticed these are not in source control: svn status --verbose solr/src/test-files/solr/conf/solrconfig.xml svn status --verbose solr/src/test-files/solr/conf/schema.xml
          Hide
          Bill Bell added a comment - - edited

          On trunk I applied the patch and get an error. What revision of SOLR should I get to apply this patch?

          [javac] Compiling 24 source files to /home/solr/src/dev/trunk/lucene/build/contrib/spatial/classes/java
          [javac] /home/solr/src/dev/trunk/lucene/contrib/spatial/src/java/org/apache/lucene/spatial/geohash/GeoHashPrefixFilter.java:41: org.apache.lucene.spatial.geohash.GeoHashPrefixFilter is not abstract and does not override abstract method getDocIdSet(org.apache.lucene.index.IndexReader.AtomicReaderContext) in org.apache.lucene.search.Filter
          [javac] public class GeoHashPrefixFilter extends Filter {
          [javac] ^
          [javac] /home/solr/src/dev/trunk/lucene/contrib/spatial/src/java/org/apache/lucene/spatial/geohash/GeoHashPrefixFilter.java:54: method does not override or implement a method from a supertype
          [javac] @Override
          [javac] ^
          [javac] Note: /home/solr/src/dev/trunk/lucene/contrib/spatial/src/java/org/apache/lucene/spatial/tier/CartesianPolyFilterBuilder.java uses or overrides a deprecated API.
          [javac] Note: Recompile with -Xlint:deprecation for details.
          [javac] 2 errors

          Ideas?

          Show
          Bill Bell added a comment - - edited On trunk I applied the patch and get an error. What revision of SOLR should I get to apply this patch? [javac] Compiling 24 source files to /home/solr/src/dev/trunk/lucene/build/contrib/spatial/classes/java [javac] /home/solr/src/dev/trunk/lucene/contrib/spatial/src/java/org/apache/lucene/spatial/geohash/GeoHashPrefixFilter.java:41: org.apache.lucene.spatial.geohash.GeoHashPrefixFilter is not abstract and does not override abstract method getDocIdSet(org.apache.lucene.index.IndexReader.AtomicReaderContext) in org.apache.lucene.search.Filter [javac] public class GeoHashPrefixFilter extends Filter { [javac] ^ [javac] /home/solr/src/dev/trunk/lucene/contrib/spatial/src/java/org/apache/lucene/spatial/geohash/GeoHashPrefixFilter.java:54: method does not override or implement a method from a supertype [javac] @Override [javac] ^ [javac] Note: /home/solr/src/dev/trunk/lucene/contrib/spatial/src/java/org/apache/lucene/spatial/tier/CartesianPolyFilterBuilder.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 2 errors Ideas?
          Hide
          David Smiley added a comment -

          Hi Bill.

          You simply throw this in your solrconfig.xml:

          <queryParser name="geo" class="solr.SpatialGeohashFilterQParser$Plugin" />
          

          The parameters are different than the SpatialFilterQParser that's in trunk already. Take a look at the javadocs I wrote.
          By the way, there's a stupid bug in SpatialGeohashFilterQParser.parseBox() I found an hour ago; the array indexes should go 0,1,2,3 not 0,1,3,4. I have test code from my project I forgot to port to this patch that would have caught that.

          I have yet to make the polygon work at the poles or dateline. I nearly got a headache trying to figure out how to make that work In the app that I'm doing this for, our map viewing window doesn't wrap around at those boundaries and so it hasn't been an issue. If you have any suggestions or pointers on how to approach the problem then I'm interested.

          Show
          David Smiley added a comment - Hi Bill. You simply throw this in your solrconfig.xml: <queryParser name= "geo" class= "solr.SpatialGeohashFilterQParser$Plugin" /> The parameters are different than the SpatialFilterQParser that's in trunk already. Take a look at the javadocs I wrote. By the way, there's a stupid bug in SpatialGeohashFilterQParser.parseBox() I found an hour ago; the array indexes should go 0,1,2,3 not 0,1,3,4. I have test code from my project I forgot to port to this patch that would have caught that. I have yet to make the polygon work at the poles or dateline. I nearly got a headache trying to figure out how to make that work In the app that I'm doing this for, our map viewing window doesn't wrap around at those boundaries and so it hasn't been an issue. If you have any suggestions or pointers on how to approach the problem then I'm interested.
          Hide
          Bill Bell added a comment -

          David,

          How do you configure patialGeohashFilterQParser ?

          Thanks for the efforts. Will the Polygon search work at North Pole and Date lines?

          Show
          Bill Bell added a comment - David, How do you configure patialGeohashFilterQParser ? Thanks for the efforts. Will the Polygon search work at North Pole and Date lines?
          Hide
          David Smiley added a comment -

          Here is another patch. By the way, I'm using revision 1055285 of trunk.

          • Removed @author tags.
          • Introduced a constant threshold at which a term scan is done instead of divide & conquer. GRIDLEN_SCAN_THRESHOLD. It used to be 2, meaning if maxlen is 9 then once we get to grid level 7 then the remaining leaves are scanned manually instead of making more boxes. I should make this configurable but it is not at this time.
          • By setting GRIDLEN_SCAN_THRESHOLD to 4, I found the performance to be superior for the geonames data when the query shape was more complex than a bbox. I haven't truly tuned this though.
          • Added polygon search based on JTS that will handle any "WKT" (well known text) query string! The JTS library (LGPL licensed) is downloaded similarly to how the "bdb" contrib module downloads sleepycat. The only limitation with this is that I don't do any special world boundary processing, which mainly matters at the dateline. That's a TODO.
          • Added SpatialGeohashFilterQParser. I don't like SpatialFilterQParser. This one handles, point-radius, bounding box, polygon, and WKT geometry inputs. The argument and inputs were developed to be made easily compatible with the geo extension to the open-search spec. If JTS is not on the classpath then this query parser should still work provided you don't do polygon or WKT (not verified but should work in theory).
          • Added a test for doing a polygon search. And I made the existing lat-lon test get executed for both geohash and latlon type.

          Here is an updated benchmark. I'm doing geohash of length 9 and this time with the threshold mentioned above at 4. The query is a circle (no bbox). This triggers the LatLonType field to do a completely different algorithm in which it loads every value into memory via the field cache and does a brute force match. This GeoHash prefix filter has never used the field cache! It uses Lucene's index. The "places/query" (which is an average) actually varied by one between both implementations. Could indicate a bug or some math rounding issue at the edge. And another point is that these benchmarks almost certainly resulted in my OS disk cache putting the relevant index files into memory.

          km places/query ms/query (LatLon) ms/query (geohash)
          11 587 10.0 4.8
          44 3,404 11.5 4.3
          230 45,536 21.8 24.0
          1800 1,319,692 288.5 142.3

          I'm pretty happy with it at this point and I'll sit on it for a while, gathering feedback.

          Show
          David Smiley added a comment - Here is another patch. By the way, I'm using revision 1055285 of trunk. Removed @author tags. Introduced a constant threshold at which a term scan is done instead of divide & conquer. GRIDLEN_SCAN_THRESHOLD. It used to be 2, meaning if maxlen is 9 then once we get to grid level 7 then the remaining leaves are scanned manually instead of making more boxes. I should make this configurable but it is not at this time. By setting GRIDLEN_SCAN_THRESHOLD to 4, I found the performance to be superior for the geonames data when the query shape was more complex than a bbox. I haven't truly tuned this though. Added polygon search based on JTS that will handle any "WKT" (well known text) query string! The JTS library (LGPL licensed) is downloaded similarly to how the "bdb" contrib module downloads sleepycat. The only limitation with this is that I don't do any special world boundary processing, which mainly matters at the dateline. That's a TODO. Added SpatialGeohashFilterQParser. I don't like SpatialFilterQParser. This one handles, point-radius, bounding box, polygon, and WKT geometry inputs. The argument and inputs were developed to be made easily compatible with the geo extension to the open-search spec. If JTS is not on the classpath then this query parser should still work provided you don't do polygon or WKT (not verified but should work in theory). Added a test for doing a polygon search. And I made the existing lat-lon test get executed for both geohash and latlon type. Here is an updated benchmark. I'm doing geohash of length 9 and this time with the threshold mentioned above at 4. The query is a circle (no bbox). This triggers the LatLonType field to do a completely different algorithm in which it loads every value into memory via the field cache and does a brute force match. This GeoHash prefix filter has never used the field cache! It uses Lucene's index. The "places/query" (which is an average) actually varied by one between both implementations. Could indicate a bug or some math rounding issue at the edge. And another point is that these benchmarks almost certainly resulted in my OS disk cache putting the relevant index files into memory. km places/query ms/query (LatLon) ms/query (geohash) 11 587 10.0 4.8 44 3,404 11.5 4.3 230 45,536 21.8 24.0 1800 1,319,692 288.5 142.3 I'm pretty happy with it at this point and I'll sit on it for a while, gathering feedback.
          Hide
          David Smiley added a comment -

          Thanks for the "Damn Cool Algorithms" aritcle! It's fantastic, I'm in geospatial geek heaven I didn't know what quadtrees were before but now I know – it's just a 4-ary grid instead of 32-ary which is geohash. Nick's illustration http://static.notdot.net/uploads/geohash-query.png should be useful for anyone coming to my code here to see how it works.

          The divergent point in the article with what I'm doing is when he attempts to limit the number of "ranges" (i.e boxes), 5 in his example. My first geohash patch had logic akin to this in which I picked a resolution to intersect with the query shape that limited the number of boxes. I'd seek to the next box I needed and then iterate over the indexed terms at that box, testing to see if it's in the query. It could potentially look at way more points than needed. Now that I'm indexing a token for each geohash precision, I found it straight-forward to implement a fully recursive algorithm down to the bottom grid (or one higher than that any way). If there are no points in a given area then it's short-circuited. The worst-case is when much of the edge of the shape passes through densely populated points. At some point there's a trade-off in which you pick between evaluating each point in the current box with the queried shape versus divide & conquer. My code here is making that decision simply by a geohash length threshold but I have some comments in there to make estimations given certain usage scenarios (e.g. one-one relationship between points and documents), and some sort of cost model for the query shape complexity.

          Hilbert Curves are interesting. Applying that to my code would improve box adjacency which will reduce the number of seek() calls, which I believe is one of the more expensive operations.

          I've thought about indexing arbitrary shapes instead of being limited to points. An indexed shape could be put into the payload of the MBR (minimum bounding rectangle) of the grid box term covering that shape – potentially duplicating it twice or worst case four times depending on the ratio of its size to intersecting grid boxes. At query time, the recursive algorithm here would examine the payloads to perform a shape intersection. Not too hard. I assume you are familiar with SOLR-2268 ? It seems Grant needs convincing to use JTS due to the LGPL license.

          Another thing I've been thinking about, by the way, is applying an "equal area projection" like http://en.wikipedia.org/wiki/Gall%E2%80%93Peters_projection Using this would enable you to get by with a smaller geohash length to meet a certain uniform accuracy since you aren't "wasting" bits, as we are now, at the poles. I have yet to calculate the savings, and it would add some computational cost at indexing time, but not really at query time.

          Show
          David Smiley added a comment - Thanks for the "Damn Cool Algorithms" aritcle! It's fantastic, I'm in geospatial geek heaven I didn't know what quadtrees were before but now I know – it's just a 4-ary grid instead of 32-ary which is geohash. Nick's illustration http://static.notdot.net/uploads/geohash-query.png should be useful for anyone coming to my code here to see how it works. The divergent point in the article with what I'm doing is when he attempts to limit the number of "ranges" (i.e boxes), 5 in his example. My first geohash patch had logic akin to this in which I picked a resolution to intersect with the query shape that limited the number of boxes. I'd seek to the next box I needed and then iterate over the indexed terms at that box, testing to see if it's in the query. It could potentially look at way more points than needed. Now that I'm indexing a token for each geohash precision, I found it straight-forward to implement a fully recursive algorithm down to the bottom grid (or one higher than that any way). If there are no points in a given area then it's short-circuited. The worst-case is when much of the edge of the shape passes through densely populated points. At some point there's a trade-off in which you pick between evaluating each point in the current box with the queried shape versus divide & conquer. My code here is making that decision simply by a geohash length threshold but I have some comments in there to make estimations given certain usage scenarios (e.g. one-one relationship between points and documents), and some sort of cost model for the query shape complexity. Hilbert Curves are interesting. Applying that to my code would improve box adjacency which will reduce the number of seek() calls, which I believe is one of the more expensive operations. I've thought about indexing arbitrary shapes instead of being limited to points. An indexed shape could be put into the payload of the MBR (minimum bounding rectangle) of the grid box term covering that shape – potentially duplicating it twice or worst case four times depending on the ratio of its size to intersecting grid boxes. At query time, the recursive algorithm here would examine the payloads to perform a shape intersection. Not too hard. I assume you are familiar with SOLR-2268 ? It seems Grant needs convincing to use JTS due to the LGPL license. Another thing I've been thinking about, by the way, is applying an "equal area projection" like http://en.wikipedia.org/wiki/Gall%E2%80%93Peters_projection Using this would enable you to get by with a smaller geohash length to meet a certain uniform accuracy since you aren't "wasting" bits, as we are now, at the poles. I have yet to calculate the savings, and it would add some computational cost at indexing time, but not really at query time.
          Hide
          Ryan McKinley added a comment -

          David – this is looking good!

          I finally have some time to focus on spatial again – and will be diving in soon. In particular, I need to work on good ways to index polygonal data - not just points.

          A quick skim of the patch... I like the GridReferenceSystem, and think it could perhaps be more general then geohash. I will be looking at quadtree-ish options perhaps hilbert curve options
          http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves

          Minor nit.. we will need to remove an @author tags before stuff gets committed

          +1 for JTS it really is the way to go for polygonal stuff

          Show
          Ryan McKinley added a comment - David – this is looking good! I finally have some time to focus on spatial again – and will be diving in soon. In particular, I need to work on good ways to index polygonal data - not just points. A quick skim of the patch... I like the GridReferenceSystem, and think it could perhaps be more general then geohash. I will be looking at quadtree-ish options perhaps hilbert curve options http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves Minor nit.. we will need to remove an @author tags before stuff gets committed +1 for JTS it really is the way to go for polygonal stuff
          Hide
          David Smiley added a comment -

          Attached is my latest patch for geohash prefix based geospatial search. This patch is a performance-centric update. GeoHashes are still used, but I'm indexing a token for every geohash length per point indexed. So a geohash length of 9 results in 9 tokens. This solves performance issues when a huge number of points were matching a query. See the below table:

          km places/query ms/query (LatLon) ms/query (geohash)
          11 692 3.8 5.0
          44 4,043 4.8 6.0
          230 57,200 15.0 17.5
          1800 1,405,767 94.0 71.0

          The LatLon is using a pair of trie doubles at a precisionStep of 8. I tried 6 & 16 but 8 was about right. The GeoHash length (a new configurable option) was chosen to be 9 which has plenty of precision for most uses (I recall it's a couple meters or less; I forget). The queries are bounding lat-lon boxes.

          What isn't in this performance table is the impact of this new algorithm on more complicated spatial queries. It's superior to the algorithm that existed before it and it should also be superior to LatLonType. Grid boxes that are completely within the query shape get efficiently added in one fell swoop.

          Code details:

          • Most of the former patch, which included a lot of additions to GeoHashUtils is no longer present in this new patch. This was basically a rewrite.
          • I abstracted use of GeoHashUtils to GridNode.GridReferenceSystem class so that in the future I can tinker with alternate more efficient encodings without breaking any code here.
          • I needed a shape interface or abstract class and so I decided to embrace & extend org.apache.lucene.spatial.geometry.shape.Geometry2D instead of having my own like I did before. I added PointDistanceGeom & MultiGeom.
          • There is an extensive random data filter test in SpatialFilterTest that I added. It's hard to follow but it teased out a few bugs.

          Next patch real soon:

          • I'm going to modify the build.xml to grab the LGPL licensed JTS library which has well-tested & high performance geometry code. In particular, I'll use it to implement a polygon shape. (Already done in another codebase; just needs to be ported to this patch)
          • I'm going to include an alternative query parser to what comes with Solr. This one will do all of point-distance, lat-lon box, and polygon. (Already done in another codebase; just needs to be ported to this patch).

          Future:

          • Replace geohash with something more efficient. Some basic testing suggests to me I could double-or-better the performance.
          • Compatibility with distance sorting / relevancy boosting when not multi-valued.

          I'd really like input from other geospatial birds-of-a-feather in Solr, especially committers.

          As an aside, MongoDB has chosen a similar algorithm.

          Show
          David Smiley added a comment - Attached is my latest patch for geohash prefix based geospatial search. This patch is a performance-centric update. GeoHashes are still used, but I'm indexing a token for every geohash length per point indexed. So a geohash length of 9 results in 9 tokens. This solves performance issues when a huge number of points were matching a query. See the below table: km places/query ms/query (LatLon) ms/query (geohash) 11 692 3.8 5.0 44 4,043 4.8 6.0 230 57,200 15.0 17.5 1800 1,405,767 94.0 71.0 The LatLon is using a pair of trie doubles at a precisionStep of 8. I tried 6 & 16 but 8 was about right. The GeoHash length (a new configurable option) was chosen to be 9 which has plenty of precision for most uses (I recall it's a couple meters or less; I forget). The queries are bounding lat-lon boxes. What isn't in this performance table is the impact of this new algorithm on more complicated spatial queries. It's superior to the algorithm that existed before it and it should also be superior to LatLonType. Grid boxes that are completely within the query shape get efficiently added in one fell swoop. Code details: Most of the former patch, which included a lot of additions to GeoHashUtils is no longer present in this new patch. This was basically a rewrite. I abstracted use of GeoHashUtils to GridNode.GridReferenceSystem class so that in the future I can tinker with alternate more efficient encodings without breaking any code here. I needed a shape interface or abstract class and so I decided to embrace & extend org.apache.lucene.spatial.geometry.shape.Geometry2D instead of having my own like I did before. I added PointDistanceGeom & MultiGeom. There is an extensive random data filter test in SpatialFilterTest that I added. It's hard to follow but it teased out a few bugs. Next patch real soon: I'm going to modify the build.xml to grab the LGPL licensed JTS library which has well-tested & high performance geometry code. In particular, I'll use it to implement a polygon shape. (Already done in another codebase; just needs to be ported to this patch) I'm going to include an alternative query parser to what comes with Solr. This one will do all of point-distance, lat-lon box, and polygon. (Already done in another codebase; just needs to be ported to this patch). Future: Replace geohash with something more efficient. Some basic testing suggests to me I could double-or-better the performance. Compatibility with distance sorting / relevancy boosting when not multi-valued. I'd really like input from other geospatial birds-of-a-feather in Solr, especially committers. As an aside, MongoDB has chosen a similar algorithm.
          Hide
          David Smiley added a comment -

          Bill, you can find examples here: http://wiki.apache.org/solr/SpatialSearch In particular, look for the filter queries involving geofilt or bbox.

          Show
          David Smiley added a comment - Bill, you can find examples here: http://wiki.apache.org/solr/SpatialSearch In particular, look for the filter queries involving geofilt or bbox.
          Hide
          Bill Bell added a comment -

          David can you show a sample query? Poly, point, etc. Thanks

          Bill Bell
          Sent from mobile

          Show
          Bill Bell added a comment - David can you show a sample query? Poly, point, etc. Thanks Bill Bell Sent from mobile
          Hide
          David Smiley added a comment -

          For evaluating the performance of geospatial search, I've contributed a patch: LUCENE-2844 "benchmark geospatial performance based on geonames.org"

          Show
          David Smiley added a comment - For evaluating the performance of geospatial search, I've contributed a patch: LUCENE-2844 "benchmark geospatial performance based on geonames.org"
          Hide
          David Smiley added a comment - - edited

          Hi Bill. I think you are referring to Grant Ingersoll. Yes, this patch was developed to be compatible with his efforts (which are not his alone). They are applied "on top" of the existing work, so to speak, enhancing the performance of the Lucene query returned by solr.GeoHashField by using a completely different algorithm. As such, you can use the existing query syntax documented on the wiki http://wiki.apache.org/solr/SpatialSearch#bbox_-_Bounding-box_filter. Simply apply this patch to trunk, and use the solr.GeoHashField type to store your "lat,lon" formatted data just like an example might show for solr.LatLonType. Of course, with this field type, you can supply as many values as you want. You needn't know anything about geohashes.

          This work is for GeoHashField, not for LatLonType. These both have a similar feel for the user, in which you provide a "lat,lon". The difference is in the implementation. solr.PointType is non-geospatial--it's for a cartesian (i.e. x,y) plane.

          An update on my work lately:

          Since the posting of this patch in October, I've created my own query parser. It handles point-distance, a specific lat-lon box (instead of being forced to pick the box encompassing a point-distance, which is what the bbox query does), and point-in-polygon searches. The polygon support required more than query parsing, of course, but figuring out how to actually implement it. I'm using the JTS library (LGPL open-source).

          I have been evaluating the performance of my patch compared to LatLonType. LatLonType scales to match basically any number of points matched due to the underlying Trie based numeric fields introduced in solr v1.4. My geohash field doesn't scale as well... so if your query matches, say, less than 100k distinct points then you'll meet or exceed LatLonType's performance (in the default configuration). However it's linear from there on out: 1000k (1M) points takes 10x as long as 100k whereas for LatLonType it's something like 1.5x as long. I've been thinking about some ideas to scale better and I hope to try it this week. If you don't expect a query to match more than ~100k points then the current patch will serve you well.

          Show
          David Smiley added a comment - - edited Hi Bill. I think you are referring to Grant Ingersoll. Yes, this patch was developed to be compatible with his efforts (which are not his alone). They are applied "on top" of the existing work, so to speak, enhancing the performance of the Lucene query returned by solr.GeoHashField by using a completely different algorithm. As such, you can use the existing query syntax documented on the wiki http://wiki.apache.org/solr/SpatialSearch#bbox_-_Bounding-box_filter . Simply apply this patch to trunk, and use the solr.GeoHashField type to store your "lat,lon" formatted data just like an example might show for solr.LatLonType. Of course, with this field type, you can supply as many values as you want. You needn't know anything about geohashes. This work is for GeoHashField, not for LatLonType. These both have a similar feel for the user, in which you provide a "lat,lon". The difference is in the implementation. solr.PointType is non-geospatial--it's for a cartesian (i.e. x,y) plane. An update on my work lately: Since the posting of this patch in October, I've created my own query parser. It handles point-distance, a specific lat-lon box (instead of being forced to pick the box encompassing a point-distance, which is what the bbox query does), and point-in-polygon searches. The polygon support required more than query parsing, of course, but figuring out how to actually implement it. I'm using the JTS library (LGPL open-source). I have been evaluating the performance of my patch compared to LatLonType. LatLonType scales to match basically any number of points matched due to the underlying Trie based numeric fields introduced in solr v1.4. My geohash field doesn't scale as well... so if your query matches, say, less than 100k distinct points then you'll meet or exceed LatLonType's performance (in the default configuration). However it's linear from there on out: 1000k (1M) points takes 10x as long as 100k whereas for LatLonType it's something like 1.5x as long. I've been thinking about some ideas to scale better and I hope to try it this week. If you don't expect a query to match more than ~100k points then the current patch will serve you well.
          Hide
          Bill Bell added a comment -

          Is this compatible with Grant Inverness's work? If so we should get it committed to the trunk so that we can try it out on various points (Poles, etc).

          Will this work with GeoHashField type only? Or does it work on all 3 types?

          solr.PointType
          solr.LatLonType
          solr.GeoHashField

          David, can you add some simple user docs (how it set it up and an example config?)

          Thanks.,

          Show
          Bill Bell added a comment - Is this compatible with Grant Inverness's work? If so we should get it committed to the trunk so that we can try it out on various points (Poles, etc). Will this work with GeoHashField type only? Or does it work on all 3 types? solr.PointType solr.LatLonType solr.GeoHashField David, can you add some simple user docs (how it set it up and an example config?) Thanks.,
          Hide
          Lance Norskog added a comment -

          I'm not sure how a "trie version" of geohash is developed. I already spoke of further refining the implementation to index the geohashes at each grid level and I think that is very similar to what trie does for numbers.

          This is why I mention the Trie classes- they seemed like the same tool, and the lessons learned in making facets etc. work are worth knowing. It seems like a Trie for geohash would just be a character-by-character Trie.

          Show
          Lance Norskog added a comment - I'm not sure how a "trie version" of geohash is developed. I already spoke of further refining the implementation to index the geohashes at each grid level and I think that is very similar to what trie does for numbers. This is why I mention the Trie classes- they seemed like the same tool, and the lessons learned in making facets etc. work are worth knowing. It seems like a Trie for geohash would just be a character-by-character Trie.
          Hide
          David Smiley added a comment -

          Using the canonical geohash gives facet values that can be copy&pasted with other software. Thinking again, this is a great feature. Would it be worth optimizing geohash with a Trie version? Trie fields (can be made to) show up correctly in facets.

          The geohash usage is purely internal to the implementation; users don't see it when they use this field. And even if they were exposed, they can be generated on-demand. There's even javascript code I've seen to do this. So I'm not married to using geohashes – it's the underlying heirarchical/gridded nature of them that is key. I'm not sure how a "trie version" of geohash is developed. I already spoke of further refining the implementation to index the geohashes at each grid level and I think that is very similar to what trie does for numbers.

          Thanks for the suggestion of using OpenStreetMaps to get locations; I'll look into that. I want to put together a useful data set – using real data as much as possible is good. I'll need to synthesize a one-to-many document to points mapping, randomly, however. And I'll need to come up with various random lat-lon box queries to perform. I'd like to use Lucene's benchmark contrib module as a framework to develop the performance test. I read about it in LIA2 and it seems to fit the bill.

          Show
          David Smiley added a comment - Using the canonical geohash gives facet values that can be copy&pasted with other software. Thinking again, this is a great feature. Would it be worth optimizing geohash with a Trie version? Trie fields (can be made to) show up correctly in facets. The geohash usage is purely internal to the implementation; users don't see it when they use this field. And even if they were exposed, they can be generated on-demand. There's even javascript code I've seen to do this. So I'm not married to using geohashes – it's the underlying heirarchical/gridded nature of them that is key. I'm not sure how a "trie version" of geohash is developed. I already spoke of further refining the implementation to index the geohashes at each grid level and I think that is very similar to what trie does for numbers. Thanks for the suggestion of using OpenStreetMaps to get locations; I'll look into that. I want to put together a useful data set – using real data as much as possible is good. I'll need to synthesize a one-to-many document to points mapping, randomly, however. And I'll need to come up with various random lat-lon box queries to perform. I'd like to use Lucene's benchmark contrib module as a framework to develop the performance test. I read about it in LIA2 and it seems to fit the bill.
          Hide
          Lance Norskog added a comment - - edited

          I've reread the patch a few times and I understand it better now, and yes there should be no equator/prime meridian problems. I retract any overt or implied criticism.

          First of all, I'm re-using the existing geohash field support in Solr which indexes the lat-lons as actual geohashes (i.e. the character representation), not in a bitwise fashion. But that doesn't really matter - it would be a worthwhile optimization to index them in that fashion as it would be more compact.

          Using the canonical geohash gives facet values that can be copy&pasted with other software. Thinking again, this is a great feature. Would it be worth optimizing geohash with a Trie version? Trie fields (can be made to) show up correctly in facets.

          And thank you for the word gazateer.

          About unit tests: I've stumbled so many times with floating point that I only trust real-world data. A good unit test would be indexing a gazateer of world data and randomly comparing points. OpenStreetMaps or Wikipedia locations for example.

          Show
          Lance Norskog added a comment - - edited I've reread the patch a few times and I understand it better now, and yes there should be no equator/prime meridian problems. I retract any overt or implied criticism. First of all, I'm re-using the existing geohash field support in Solr which indexes the lat-lons as actual geohashes (i.e. the character representation), not in a bitwise fashion. But that doesn't really matter - it would be a worthwhile optimization to index them in that fashion as it would be more compact. Using the canonical geohash gives facet values that can be copy&pasted with other software. Thinking again, this is a great feature. Would it be worth optimizing geohash with a Trie version? Trie fields (can be made to) show up correctly in facets. And thank you for the word gazateer . About unit tests: I've stumbled so many times with floating point that I only trust real-world data. A good unit test would be indexing a gazateer of world data and randomly comparing points. OpenStreetMaps or Wikipedia locations for example.
          Hide
          David Smiley added a comment -

          Lance, did you look at my patch or just skim the issue description? There seems to be a big disconnect in what I'm doing and what you think I'm doing.

          First of all, I'm re-using the existing geohash field support in Solr which indexes the lat-lons as actual geohashes (i.e. the character representation), not in a bitwise fashion. But that doesn't really matter – it would be a worthwhile optimization to index them in that fashion as it would be more compact.

          Secondly, as stated in the issue description, this filter finds multiple geohash grid squares to cover the queried area. It doesn't matter where boundaries are, 0, 90, 180, Unalaska, North pole, whatever – it doesn't matter. A search over the London or Greenwhich areas will yield grid squares that have no prefixes in common, but that doesn't matter; each grid square is subsequently searched independently against the user's query.

          Thirdly, the only distance measurements in this patch are against resolved latitude-longitudes points (e.g. decoded geohashes) in the index against the user's query if that query is point-radius (vs lat-lon bounding box which need not calculate distance). This uses haversine bu I'm not using geohashes for distance calculation.

          If you still insist there is a shortcoming of my implementation, then I challenge you to add a unit test proving that my implementation here doesn't work. The existing tests I used in unit

          Show
          David Smiley added a comment - Lance, did you look at my patch or just skim the issue description? There seems to be a big disconnect in what I'm doing and what you think I'm doing. First of all, I'm re-using the existing geohash field support in Solr which indexes the lat-lons as actual geohashes (i.e. the character representation), not in a bitwise fashion. But that doesn't really matter – it would be a worthwhile optimization to index them in that fashion as it would be more compact. Secondly, as stated in the issue description, this filter finds multiple geohash grid squares to cover the queried area. It doesn't matter where boundaries are, 0, 90, 180, Unalaska, North pole, whatever – it doesn't matter. A search over the London or Greenwhich areas will yield grid squares that have no prefixes in common, but that doesn't matter; each grid square is subsequently searched independently against the user's query . Thirdly, the only distance measurements in this patch are against resolved latitude-longitudes points (e.g. decoded geohashes) in the index against the user's query if that query is point-radius (vs lat-lon bounding box which need not calculate distance). This uses haversine bu I'm not using geohashes for distance calculation. If you still insist there is a shortcoming of my implementation, then I challenge you to add a unit test proving that my implementation here doesn't work. The existing tests I used in unit
          Hide
          Lance Norskog added a comment -

          You're right, this had no context!

          From the geohash site: Geohashes offer properties like arbitrary precision, similar prefixes for nearby positions, and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision).

          If you store geohashes in a bitwise format, you get the "N leading bits" trick: the Manhattan distance between any two hashes is the length of the first N matching bits. The more matching bits starting from the highest or "hemisphere" bit, the closer two points are.

          You can use this to search bounding boxes to a given Level Of Detail (LOD) by only comparing the first N bits (The LOD is of course a power of 2).

          The core problem with this bitwise comparison trick is that one zero crossing is in Greenwich, in Greater London. The other is at the equator. So this bitwise search trick works in most of the world, just not in London or at the Equator.Street mapping and "find the nearest X" are major use cases for geo-search. So, we have an ultra-fast bounding box search that blows up in London. (Of course, not just London everything at Longitude 0.00.)

          The longitude above goes through Unalaska in an area with no roads, giving a zero crossing that blows up in a sparsely inhabited area. Then, instead of the Equator, use the North Pole as the zero crossing. The longitude passes through the island where there are no roads, and there are no streets (yet) at the North Pole. Street mapping applications would work perfectly well with a Rotated Geohash. Thus, rotating the geohash gives a variable-LOD bitwise search that always works and is very very fast.

          Super-fast Manhattan distance search may not be an interesting goal any more, since CPUs are so fast. So, rotating the basis of the Geohash is probably not worthwhile. Also, it would generate loads of confused traffic on solr-user.

          Does this help?

          Show
          Lance Norskog added a comment - You're right, this had no context! From the geohash site : Geohashes offer properties like arbitrary precision, similar prefixes for nearby positions, and the possibility of gradually removing characters from the end of the code to reduce its size (and gradually lose precision). If you store geohashes in a bitwise format, you get the "N leading bits" trick: the Manhattan distance between any two hashes is the length of the first N matching bits. The more matching bits starting from the highest or "hemisphere" bit, the closer two points are. You can use this to search bounding boxes to a given Level Of Detail (LOD) by only comparing the first N bits (The LOD is of course a power of 2). The core problem with this bitwise comparison trick is that one zero crossing is in Greenwich, in Greater London. The other is at the equator. So this bitwise search trick works in most of the world, just not in London or at the Equator.Street mapping and "find the nearest X" are major use cases for geo-search. So, we have an ultra-fast bounding box search that blows up in London . (Of course, not just London everything at Longitude 0.00.) The longitude above goes through Unalaska in an area with no roads, giving a zero crossing that blows up in a sparsely inhabited area. Then, instead of the Equator, use the North Pole as the zero crossing. The longitude passes through the island where there are no roads, and there are no streets (yet) at the North Pole. Street mapping applications would work perfectly well with a Rotated Geohash. Thus, rotating the geohash gives a variable-LOD bitwise search that always works and is very very fast. Super-fast Manhattan distance search may not be an interesting goal any more, since CPUs are so fast. So, rotating the basis of the Geohash is probably not worthwhile. Also, it would generate loads of confused traffic on solr-user. Does this help?
          Hide
          David Smiley added a comment -

          The presentation I gave at Lucene Revolution on this subject was finally uploaded: http://www.lucidimagination.com/files/Lucene%20Rev%20Preso%20Smiley%20Spatial%20Search.pdf

          Show
          David Smiley added a comment - The presentation I gave at Lucene Revolution on this subject was finally uploaded: http://www.lucidimagination.com/files/Lucene%20Rev%20Preso%20Smiley%20Spatial%20Search.pdf
          Hide
          David Smiley added a comment -

          Lance, sorry, you've completely lost me; I don't understand anything you've said. Can you please try to explain your points in more detail, any of them at least?

          I don't see what's so special about this point on the earth you've drawn attention to:
          65°37′21″N 168°20′42″W geohash: b7b01fqvuff1
          No point or location in geohash is special or problematic. There are gridlines at every resolution which need to be dealt with – and it's not hard.

          Show
          David Smiley added a comment - Lance, sorry, you've completely lost me; I don't understand anything you've said. Can you please try to explain your points in more detail, any of them at least? I don't see what's so special about this point on the earth you've drawn attention to: 65°37′21″N 168°20′42″W geohash: b7b01fqvuff1 No point or location in geohash is special or problematic. There are gridlines at every resolution which need to be dealt with – and it's not hard.
          Hide
          Lance Norskog added a comment - - edited

          The problem with Geohash is that it puts zeros in Greater London and at the Equator, so every computation that uses it has to dodge at these points. More to the point, the Hamming distance trick does not work, so a simple super-fast scan of an array of Lucene Trie-Integers does not work.

          On the Aleutian island of Unalaska, there is a longitude which goes through a mountainous region with no roads. The longitude does not touch any other land north of Antarctica.

          Unalaska

          65°37′21″N 168°20′42″W

          If you rotate the geohash frame of reference to latitude North Pole and a longitude through Unalaska, you can cheerfully ignore all of the zero points, at the cost of alienating some Inuit.

          Seriously, if you limit Geohash accuracy to land-based services (like maps), with an accuracy warning about comparing Nome and Vladivostok, sacrificing a road-less part of an Aleutian island seems a small price to pay.

          Show
          Lance Norskog added a comment - - edited The problem with Geohash is that it puts zeros in Greater London and at the Equator, so every computation that uses it has to dodge at these points. More to the point, the Hamming distance trick does not work, so a simple super-fast scan of an array of Lucene Trie-Integers does not work. On the Aleutian island of Unalaska, there is a longitude which goes through a mountainous region with no roads. The longitude does not touch any other land north of Antarctica. Unalaska 65°37′21″N 168°20′42″W If you rotate the geohash frame of reference to latitude North Pole and a longitude through Unalaska, you can cheerfully ignore all of the zero points, at the cost of alienating some Inuit. Seriously, if you limit Geohash accuracy to land-based services (like maps), with an accuracy warning about comparing Nome and Vladivostok, sacrificing a road-less part of an Aleutian island seems a small price to pay.
          Hide
          Robert Muir added a comment -

          One area that I know nothing about is how scoring/sorting actually works within Lucene. For the work here I wasn't in need of that but many people clearly want that. In your opinion Rob, is there any opportunity for geo sorting/relevancy code to take advantage of any efficiencies done here or are they completely unrelated?

          Thats a good question, I'm not sure this stuff will help with that. But there is also a lot of people who like you, don't need/want to integrate it into scoring and maybe just want to filter on distance really fast, and score based on something else.

          I'm not a spatial guy and don't understand the spatial goings-on, but it seems like maybe people who want to do relevance based on distance could achieve that some other way and use the trie value to just have a really fast bounding "box" filter.

          maybe they use a solr function query or however this is done but the filter would speed it up tremendously, of course with some loss of precision, but this is search, its not like the textual component has perfect precision, and a lot of people arent going across meridians or the earth's poles or anything.

          Show
          Robert Muir added a comment - One area that I know nothing about is how scoring/sorting actually works within Lucene. For the work here I wasn't in need of that but many people clearly want that. In your opinion Rob, is there any opportunity for geo sorting/relevancy code to take advantage of any efficiencies done here or are they completely unrelated? Thats a good question, I'm not sure this stuff will help with that. But there is also a lot of people who like you, don't need/want to integrate it into scoring and maybe just want to filter on distance really fast, and score based on something else. I'm not a spatial guy and don't understand the spatial goings-on, but it seems like maybe people who want to do relevance based on distance could achieve that some other way and use the trie value to just have a really fast bounding "box" filter. maybe they use a solr function query or however this is done but the filter would speed it up tremendously, of course with some loss of precision, but this is search, its not like the textual component has perfect precision, and a lot of people arent going across meridians or the earth's poles or anything.
          Hide
          David Smiley added a comment -

          Yes, absolutely Rob. I went with geohashes because it was a straight-forward implementation to prove out the concept. It appears my patch is the first of its kind for Lucene/Solr. For doing a more efficient Morton representation, I have already looked at the work going on at javageomodel: http://code.google.com/p/javageomodel/ which was built for use with Google BigTable. The code there is largely pure java, keep in mind. It's the same concept but it uses a dictionary of size 16 (representable by 4 bits) which results in cleaner algorithms than geohashes' 5-bit dictionary which has some even/odd rules to it which are awkward. But yes, it would be more efficient to store the actual intended bits, not characters.

          One area that I know nothing about is how scoring/sorting actually works within Lucene. For the work here I wasn't in need of that but many people clearly want that. In your opinion Rob, is there any opportunity for geo sorting/relevancy code to take advantage of any efficiencies done here or are they completely unrelated?

          (I meant to track you down at LuceneRevolution to say hi but I missed the opportunity.)

          Show
          David Smiley added a comment - Yes, absolutely Rob. I went with geohashes because it was a straight-forward implementation to prove out the concept. It appears my patch is the first of its kind for Lucene/Solr. For doing a more efficient Morton representation, I have already looked at the work going on at javageomodel: http://code.google.com/p/javageomodel/ which was built for use with Google BigTable. The code there is largely pure java, keep in mind. It's the same concept but it uses a dictionary of size 16 (representable by 4 bits) which results in cleaner algorithms than geohashes' 5-bit dictionary which has some even/odd rules to it which are awkward. But yes, it would be more efficient to store the actual intended bits, not characters. One area that I know nothing about is how scoring/sorting actually works within Lucene. For the work here I wasn't in need of that but many people clearly want that. In your opinion Rob, is there any opportunity for geo sorting/relevancy code to take advantage of any efficiencies done here or are they completely unrelated? (I meant to track you down at LuceneRevolution to say hi but I missed the opportunity.)
          Hide
          Robert Muir added a comment -

          Since a geohash is a textual representation of a Morton number (interleaving bits),
          wouldn't it be better to use a Morton Number (numeric representation)
          so that NumericRangeQuery/Filter could be used instead of PrefixQuery or
          TermRangeQuery?

          It would be the same results, only faster, as it would need to visit less terms at search-time.

          Show
          Robert Muir added a comment - Since a geohash is a textual representation of a Morton number (interleaving bits), wouldn't it be better to use a Morton Number (numeric representation) so that NumericRangeQuery/Filter could be used instead of PrefixQuery or TermRangeQuery? It would be the same results, only faster, as it would need to visit less terms at search-time.
          Hide
          David Smiley added a comment -

          My next step is to measure performance, perhaps using Lucene's benchmark contrib module with some as yet unidentified source of lat-lons. I then can due some tuning. I've identified several areas for improvement I will tackle, most notably indexing geohashes at all character lengths thereby enabling an algorithm that can do faster queries that cover many points. I've heard others call this "geospatial tiers" which is in effect what I'll be doing. I'll also add a PolygonGeoShape.

          Show
          David Smiley added a comment - My next step is to measure performance, perhaps using Lucene's benchmark contrib module with some as yet unidentified source of lat-lons. I then can due some tuning. I've identified several areas for improvement I will tackle, most notably indexing geohashes at all character lengths thereby enabling an algorithm that can do faster queries that cover many points. I've heard others call this "geospatial tiers" which is in effect what I'll be doing. I'll also add a PolygonGeoShape.
          Hide
          David Smiley added a comment -

          This attached patch is tested, both at the GeoHashUtils level and via SpatialFilterTest. I added tests to both of these. I added ASF headers.

          Show
          David Smiley added a comment - This attached patch is tested, both at the GeoHashUtils level and via SpatialFilterTest. I added tests to both of these. I added ASF headers.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              21 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development