Lucene - Core
  1. Lucene - Core
  2. LUCENE-2844

benchmark geospatial performance based on geonames.org

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      See comments for details.
      In particular, the original patch "benchmark-geo.patch" is fairly different than LUCENE-2844.patch

      1. LUCENE-2844_spatial_benchmark.patch
        23 kB
        David Smiley
      2. LUCENE-2844_spatial_benchmark.patch
        27 kB
        David Smiley
      3. benchmark-geo.patch
        25 kB
        David Smiley
      4. benchmark-geo.patch
        27 kB
        David Smiley

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit 1536181 from David Smiley in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1536181 ]

          LUCENE-2844: spatial benchmark

          Show
          ASF subversion and git services added a comment - Commit 1536181 from David Smiley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1536181 ] LUCENE-2844 : spatial benchmark
          Hide
          ASF subversion and git services added a comment -

          Commit 1536180 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1536180 ]

          LUCENE-2844: spatial benchmark

          Show
          ASF subversion and git services added a comment - Commit 1536180 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1536180 ] LUCENE-2844 : spatial benchmark
          Hide
          ASF subversion and git services added a comment -

          Commit 1536178 from David Smiley in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1536178 ]

          LUCENE-2844: fix java 7 <>

          Show
          ASF subversion and git services added a comment - Commit 1536178 from David Smiley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1536178 ] LUCENE-2844 : fix java 7 <>
          Hide
          ASF subversion and git services added a comment -

          Commit 1536177 from David Smiley in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1536177 ]

          LUCENE-2844: spatial benchmark

          Show
          ASF subversion and git services added a comment - Commit 1536177 from David Smiley in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1536177 ] LUCENE-2844 : spatial benchmark
          Hide
          ASF subversion and git services added a comment -

          Commit 1536176 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1536176 ]

          LUCENE-2844: spatial benchmark

          Show
          ASF subversion and git services added a comment - Commit 1536176 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1536176 ] LUCENE-2844 : spatial benchmark
          Hide
          David Smiley added a comment -

          The attached patch added documentation. I chose to leave the extensive option list to spatial.alg instead of redundantly listing it elsewhere, but I do reference elsewhere to look in spatial.alg for the listing if you're looking for it.

          I put a compressed allCountries.txt up on people.apache.org, which is a randomized-line order version of the one from geonames. This is fetched instead of the live one for reproducibility of test results.

          I made various other fairly minor improvements too. Notably if another SpatialStrategy implementation needs to be tested, it should be feasible to do it via extending the SpatialDocMaker without duplicating much code.

          I intend to commit this in a couple days or so.

          Show
          David Smiley added a comment - The attached patch added documentation. I chose to leave the extensive option list to spatial.alg instead of redundantly listing it elsewhere, but I do reference elsewhere to look in spatial.alg for the listing if you're looking for it. I put a compressed allCountries.txt up on people.apache.org, which is a randomized-line order version of the one from geonames. This is fetched instead of the live one for reproducibility of test results. I made various other fairly minor improvements too. Notably if another SpatialStrategy implementation needs to be tested, it should be feasible to do it via extending the SpatialDocMaker without duplicating much code. I intend to commit this in a couple days or so.
          Hide
          David Smiley added a comment -

          I completely re-did this with a summer intern, Liviy Ambrose. It's similar but simpler to the first approach; it isn't based on it. Unlike the first patch, it does not modify any of the existing benchmark code (aside from the build.xml of course). I intend to enhance the benchmark code under separate issues, so that this patch can focus on just spatial benchmarking.

          Test data

          The build.xml grabs a tab-separated values file from geonames.org, which contains millions of latitude & longitude based points. I want to take a snapshot (for reproducible tests), randomize the line order, and put it on http://people.apache.org/~dsmiley/. Additionally, Spatial4j's tests has a file containing a WKT-formatted polygon for many countries. I want to host that as well in a format readable by LineDocSource.

          Source files (only 3):

          • GeonamesLineParser.java: This is designed for use with LineDocSource. Geonames.org data comes in a tab-separated value file.
          • SpatialDocMaker.java: This class is key.
            • It holds a reference to the Lucene SpatialStrategy which it configures from the algorithm file, mostly via factories. It's possible to test quite a variety of spatial configurations, although it does assume RecursivePrefixTree.
            • This DocMaker has the specialization to convert the shape-formatted string in the body field to a Shape object to be indexed. It also has a configurable ShapeConverter to optionally convert a point to a circle or bounding box.
          • SpatialFileQueryMaker.java: Instead of hard-coded queries (as seen in other non-spatial tests), it configures a private LineDocSource instance and it reads the shapes off that to use as spatial queries. For now you'd use it with GeonamesLineParser. Furthermore, it re-uses SpatialDocMaker's ShapeConverter so that the points can then become circle or rectangle queries.

          The provided spatial.alg shows how to use it.

          Notes:

          • The spatial data is placed into the "body" field of a standard benchmark DocData class as a string. Originally I experimented with a custom SpatialDocData but I determined it was needless to do that since the existing class can work. And after all, if you're testing spatial, you don't need to be simultaneously testing text. I didn't put it in DocData's attached Properties instance because that seems kinda heavyweight or at least medium-weight

          The patch is not ready – I need to add documentation, pending input on this approach.

          Show
          David Smiley added a comment - I completely re-did this with a summer intern, Liviy Ambrose. It's similar but simpler to the first approach; it isn't based on it. Unlike the first patch, it does not modify any of the existing benchmark code (aside from the build.xml of course). I intend to enhance the benchmark code under separate issues, so that this patch can focus on just spatial benchmarking. Test data The build.xml grabs a tab-separated values file from geonames.org, which contains millions of latitude & longitude based points. I want to take a snapshot (for reproducible tests), randomize the line order, and put it on http://people.apache.org/~dsmiley/ . Additionally, Spatial4j's tests has a file containing a WKT-formatted polygon for many countries. I want to host that as well in a format readable by LineDocSource. Source files (only 3): GeonamesLineParser.java: This is designed for use with LineDocSource. Geonames.org data comes in a tab-separated value file. SpatialDocMaker.java: This class is key. It holds a reference to the Lucene SpatialStrategy which it configures from the algorithm file, mostly via factories. It's possible to test quite a variety of spatial configurations, although it does assume RecursivePrefixTree. This DocMaker has the specialization to convert the shape-formatted string in the body field to a Shape object to be indexed. It also has a configurable ShapeConverter to optionally convert a point to a circle or bounding box. SpatialFileQueryMaker.java: Instead of hard-coded queries (as seen in other non-spatial tests), it configures a private LineDocSource instance and it reads the shapes off that to use as spatial queries. For now you'd use it with GeonamesLineParser. Furthermore, it re-uses SpatialDocMaker's ShapeConverter so that the points can then become circle or rectangle queries. The provided spatial.alg shows how to use it. Notes: The spatial data is placed into the "body" field of a standard benchmark DocData class as a string. Originally I experimented with a custom SpatialDocData but I determined it was needless to do that since the existing class can work. And after all, if you're testing spatial, you don't need to be simultaneously testing text. I didn't put it in DocData's attached Properties instance because that seems kinda heavyweight or at least medium-weight The patch is not ready – I need to add documentation, pending input on this approach.
          Hide
          David Smiley added a comment -

          benchmark-geo.patch (2011-01)

          Until now (with this patch), the benchmark contrib module did not include a means to test geospatial data. This patch includes some new files and changes to existing ones. Here is a summary of what is being added in this patch per file (all files below are within the benchmark contrib module) along with my notes:

          Changes:

          • build.xml – Add dependency on Lucene's spatial module and Solr.
          • ReadTask.java – Added a search.useHitTotal boolean option that will use the total hits number for reporting purposes, instead of the existing behavior.
            • The existing behavior (i.e. when search.useHitTotal=false) doesn't look very useful since the response integer is the sum of several things instead of just one thing. I don't see how anyone makes use of it.

          Note that on my local system, I also changed ReportTask & RepSelectByPrefTask to not include the '-' every other line, and also changed Format.java to not use commas in the numbers. These changes are to make copy-pasting into excel more streamlined.

          New Files:

          • geoname-spatial.alg – my algorithm file.
            • Note the ":0" trailing the Populate sequence. This is a trick I use to skip building the index, since it takes a while to build and I'm not interested in benchmarking index construction. You'll want to set this to :1 and then subsequently put it back for further runs as long as you keep the doc.geo.schemaField or any other configuration elements affecting index the same.
            • In the patch, doc.geo.schemaField=geohash but unless you're tinkering with SOLR-2155, you'll probably want to set this to "latlon"
          • GeoNamesContentSource.java – a ContentSource for a geonames.org data file (either a single country like US.txt or allCountries.txt).
            • Uses a subclass of DocData to store all the fields. The existing DocData wasn't very applicable to data that is not composed of a title and body.
            • Doesn't reuse the docdata parameter to getNextDocData(); a new one is created every time.
            • Only supports content.source.forever=false
          • GeoNamesDocMaker.java – a subclass of DocMaker that works very differently than the existing DocMaker.
            • Instead of assuming that each line from geonames.org will correspond to one Lucene document, this implementation supports, via configuration, creating a variable number of documents, each with a variable number of points taken randomly from a GeoNamesContentSource.
            • doc.geo.docsToGenerate: The number of documents to generate. If blank it defaults to the number of rows in GeoNamesContentSource.
            • doc.geo.avgPlacesPerDoc: The average number of places to be added to a document. A random number between 0 and one less than twice this amount is chosen on a per document basis. If this is set to 1, then exactly one is always used. In order to support a value greater than 1, use the geohash field type and incorporate SOLR-2155 (geohash prefix technique).
            • doc.geo.oneDocPerPlace: Whether at most one document should use the same place. In other words, Can more than one document have the same place? If so, set this to false.
            • doc.geo.schemaField: references a field name in schema.xml. The field should implement SpatialQueryable.
          • GeoPerfData.java: This class is a singleton storing data in memory that is shared by GeoNamesDocMaker.java and GeoQueryMaker.java.
            • content.geo.zeroPopSubst: if a population is encountered that is <= 0, then use this population value instead. Default is 100.
            • content.geo.maxPlaces: A limit on the number of rows read in from GeoNamesContentSource.java can be set here. Defaults to Integer.MAX_VALUE.
            • GeoPerfData is primarily responsible for reading in data from GeoNamesContentSource into memory to store the lat, lon, and population. When a random place is asked for, you get one weighted according to population. The idea is to skew the data towards more referenced places, and a population number is a decent way of doing it.
          • GeoQueryMaker.java – returns random queries from GeoPerfData by taking a random point and using a particular configured radius. A pure lat-lon bounding box query is ultimately done.
            • query.geo.radiuskm: The radius of the query in kilometers.
          • schema.xml – a Solr schema file to configure SpatialQueriable fields referenced by doc.geo.schemaField.

          When I run this algorithm as provided with the file in the patch, I get this result:

          Operation   round ____km   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
          Search_40       0    350        1      4811687 1,206,541.38        3.99   117,722,664    191,934,464
          

          The key metrics I use are the average milliseconds per query, and the average places per query. The number of queries performed is the trailing numeric suffix to Operation. The Formulas:

          • avg ms/query: elapsedSec*1000/queries == 98.8
          • avg places / query: recsPerRun/queries == 120,292
          Show
          David Smiley added a comment - benchmark-geo.patch (2011-01) Until now (with this patch), the benchmark contrib module did not include a means to test geospatial data. This patch includes some new files and changes to existing ones. Here is a summary of what is being added in this patch per file (all files below are within the benchmark contrib module) along with my notes: Changes: build.xml – Add dependency on Lucene's spatial module and Solr. It was a real pain to figure out the convoluted ant build system to make this work, and I doubt I did it the proper way. Rob Muir thought it would be a good idea to make the benchmark contrib module be top level module (i.e. be alongside analysis) so that it can depend on everything. http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2157824.html I agree ReadTask.java – Added a search.useHitTotal boolean option that will use the total hits number for reporting purposes, instead of the existing behavior. The existing behavior (i.e. when search.useHitTotal=false) doesn't look very useful since the response integer is the sum of several things instead of just one thing. I don't see how anyone makes use of it. Note that on my local system, I also changed ReportTask & RepSelectByPrefTask to not include the '-' every other line, and also changed Format.java to not use commas in the numbers. These changes are to make copy-pasting into excel more streamlined. New Files: geoname-spatial.alg – my algorithm file. Note the ":0" trailing the Populate sequence. This is a trick I use to skip building the index, since it takes a while to build and I'm not interested in benchmarking index construction. You'll want to set this to :1 and then subsequently put it back for further runs as long as you keep the doc.geo.schemaField or any other configuration elements affecting index the same. In the patch, doc.geo.schemaField=geohash but unless you're tinkering with SOLR-2155 , you'll probably want to set this to "latlon" GeoNamesContentSource.java – a ContentSource for a geonames.org data file (either a single country like US.txt or allCountries.txt). Uses a subclass of DocData to store all the fields. The existing DocData wasn't very applicable to data that is not composed of a title and body. Doesn't reuse the docdata parameter to getNextDocData(); a new one is created every time. Only supports content.source.forever=false GeoNamesDocMaker.java – a subclass of DocMaker that works very differently than the existing DocMaker. Instead of assuming that each line from geonames.org will correspond to one Lucene document, this implementation supports, via configuration, creating a variable number of documents, each with a variable number of points taken randomly from a GeoNamesContentSource. doc.geo.docsToGenerate: The number of documents to generate. If blank it defaults to the number of rows in GeoNamesContentSource. doc.geo.avgPlacesPerDoc: The average number of places to be added to a document. A random number between 0 and one less than twice this amount is chosen on a per document basis. If this is set to 1, then exactly one is always used. In order to support a value greater than 1, use the geohash field type and incorporate SOLR-2155 (geohash prefix technique). doc.geo.oneDocPerPlace: Whether at most one document should use the same place. In other words, Can more than one document have the same place? If so, set this to false. doc.geo.schemaField: references a field name in schema.xml. The field should implement SpatialQueryable. GeoPerfData.java: This class is a singleton storing data in memory that is shared by GeoNamesDocMaker.java and GeoQueryMaker.java. content.geo.zeroPopSubst: if a population is encountered that is <= 0, then use this population value instead. Default is 100. content.geo.maxPlaces: A limit on the number of rows read in from GeoNamesContentSource.java can be set here. Defaults to Integer.MAX_VALUE. GeoPerfData is primarily responsible for reading in data from GeoNamesContentSource into memory to store the lat, lon, and population. When a random place is asked for, you get one weighted according to population. The idea is to skew the data towards more referenced places, and a population number is a decent way of doing it. GeoQueryMaker.java – returns random queries from GeoPerfData by taking a random point and using a particular configured radius. A pure lat-lon bounding box query is ultimately done. query.geo.radiuskm: The radius of the query in kilometers. schema.xml – a Solr schema file to configure SpatialQueriable fields referenced by doc.geo.schemaField. When I run this algorithm as provided with the file in the patch, I get this result: Operation round ____km runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem Search_40 0 350 1 4811687 1,206,541.38 3.99 117,722,664 191,934,464 The key metrics I use are the average milliseconds per query, and the average places per query. The number of queries performed is the trailing numeric suffix to Operation. The Formulas: avg ms/query: elapsedSec*1000/queries == 98.8 avg places / query: recsPerRun/queries == 120,292
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          David Smiley added a comment -

          This is an update to the patch which considers the move of the benchmark contrib to /modules/benchmark. It also includes GeoNamesSetSolrAnalyzerTask which will use Solr's field-specific analyzer. It's very much tied to these set of classes in the patch. There are ASF headers now too.

          Show
          David Smiley added a comment - This is an update to the patch which considers the move of the benchmark contrib to /modules/benchmark. It also includes GeoNamesSetSolrAnalyzerTask which will use Solr's field-specific analyzer. It's very much tied to these set of classes in the patch. There are ASF headers now too.
          Hide
          Robert Muir added a comment -

          David, I'll first create an issue to propose moving benchmark/ to modules.

          I've personally been frustrated by this before (just simple stuff like wanting to benchmark some analysis
          definition in a schema.xml for ReadTokens/indexing speed and having to actually write an Analyzer.java to do it)

          Show
          Robert Muir added a comment - David, I'll first create an issue to propose moving benchmark/ to modules. I've personally been frustrated by this before (just simple stuff like wanting to benchmark some analysis definition in a schema.xml for ReadTokens/indexing speed and having to actually write an Analyzer.java to do it)

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development