Lucene - Core
  1. Lucene - Core
  2. LUCENE-2844

benchmark geospatial performance based on geonames.org

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      See comments for details.
      In particular, the original patch "benchmark-geo.patch" is fairly different than LUCENE-2844.patch

      1. LUCENE-2844_spatial_benchmark.patch
        27 kB
        David Smiley
      2. LUCENE-2844_spatial_benchmark.patch
        23 kB
        David Smiley
      3. benchmark-geo.patch
        27 kB
        David Smiley
      4. benchmark-geo.patch
        25 kB
        David Smiley

        Issue Links

          Activity

          David Smiley created issue -
          David Smiley made changes -
          Field Original Value New Value
          Attachment benchmark-geo.patch [ 12467326 ]
          Robert Muir made changes -
          Link This issue depends on LUCENE-2845 [ LUCENE-2845 ]
          David Smiley made changes -
          Attachment benchmark-geo.patch [ 12468689 ]
          Mark Thomas made changes -
          Workflow jira [ 12541406 ] Default workflow, editable Closed status [ 12563922 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563922 ] jira [ 12585415 ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321140 ]
          Fix Version/s 4.0 [ 12314025 ]
          Mark Miller made changes -
          Fix Version/s 5.0 [ 12321663 ]
          Mark Miller made changes -
          Fix Version/s 4.2 [ 12323899 ]
          Fix Version/s 4.1 [ 12321140 ]
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324143 ]
          Fix Version/s 5.0 [ 12321663 ]
          Fix Version/s 4.2 [ 12323899 ]
          Gavin made changes -
          Link This issue depends on LUCENE-2845 [ LUCENE-2845 ]
          Gavin made changes -
          Link This issue depends upon LUCENE-2845 [ LUCENE-2845 ]
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324323 ]
          Fix Version/s 4.3 [ 12324143 ]
          Steve Rowe made changes -
          Fix Version/s 5.0 [ 12321663 ]
          Fix Version/s 4.5 [ 12324742 ]
          Fix Version/s 4.4 [ 12324323 ]
          David Smiley made changes -
          Assignee David Smiley [ dsmiley ]
          David Smiley made changes -
          Fix Version/s 4.6 [ 12324999 ]
          Fix Version/s 4.5 [ 12324742 ]
          David Smiley made changes -
          Description Until now (with this patch), the benchmark contrib module did not include a means to test geospatial data. This patch includes some new files and changes to existing ones. Here is a summary of what is being added in this patch per file (all files below are within the benchmark contrib module) along with my notes:

          Changes:
          * build.xml -- Add dependency on Lucene's spatial module and Solr.
          ** It was a real pain to figure out the convoluted ant build system to make this work, and I doubt I did it the proper way.
          ** Rob Muir thought it would be a good idea to make the benchmark contrib module be top level module (i.e. be alongside analysis) so that it can depend on everything. http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2157824.html I agree
          * ReadTask.java -- Added a search.useHitTotal boolean option that will use the total hits number for reporting purposes, instead of the existing behavior.
          ** The existing behavior (i.e. when search.useHitTotal=false) doesn't look very useful since the response integer is the sum of several things instead of just one thing. I don't see how anyone makes use of it.

          Note that on my local system, I also changed ReportTask & RepSelectByPrefTask to not include the '-' every other line, and also changed Format.java to not use commas in the numbers. These changes are to make copy-pasting into excel more streamlined.

          New Files:
          * geoname-spatial.alg -- my algorithm file.
          ** Note the ":0" trailing the Populate sequence. This is a trick I use to skip building the index, since it takes a while to build and I'm not interested in benchmarking index construction. You'll want to set this to :1 and then subsequently put it back for further runs as long as you keep the doc.geo.schemaField or any other configuration elements affecting index the same.
          ** In the patch, doc.geo.schemaField=geohash but unless you're tinkering with SOLR-2155, you'll probably want to set this to "latlon"
          * GeoNamesContentSource.java -- a ContentSource for a geonames.org data file (either a single country like US.txt or allCountries.txt).
          ** Uses a subclass of DocData to store all the fields. The existing DocData wasn't very applicable to data that is not composed of a title and body.
          ** Doesn't reuse the docdata parameter to getNextDocData(); a new one is created every time.
          ** Only supports content.source.forever=false
          * GeoNamesDocMaker.java -- a subclass of DocMaker that works very differently than the existing DocMaker.
          ** Instead of assuming that each line from geonames.org will correspond to one Lucene document, this implementation supports, via configuration, creating a variable number of documents, each with a variable number of points taken randomly from a GeoNamesContentSource.
          ** doc.geo.docsToGenerate: The number of documents to generate. If blank it defaults to the number of rows in GeoNamesContentSource.
          ** doc.geo.avgPlacesPerDoc: The average number of places to be added to a document. A random number between 0 and one less than twice this amount is chosen on a per document basis. If this is set to 1, then exactly one is always used. In order to support a value greater than 1, use the geohash field type and incorporate SOLR-2155 (geohash prefix technique).
          ** doc.geo.oneDocPerPlace: Whether at most one document should use the same place. In other words, Can more than one document have the same place? If so, set this to false.
          ** doc.geo.schemaField: references a field name in schema.xml. The field should implement SpatialQueryable.
          * GeoPerfData.java: This class is a singleton storing data in memory that is shared by GeoNamesDocMaker.java and GeoQueryMaker.java.
          ** content.geo.zeroPopSubst: if a population is encountered that is <= 0, then use this population value instead. Default is 100.
          ** content.geo.maxPlaces: A limit on the number of rows read in from GeoNamesContentSource.java can be set here. Defaults to Integer.MAX_VALUE.
          ** GeoPerfData is primarily responsible for reading in data from GeoNamesContentSource into memory to store the lat, lon, and population. When a random place is asked for, you get one weighted according to population. The idea is to skew the data towards more referenced places, and a population number is a decent way of doing it.
          * GeoQueryMaker.java -- returns random queries from GeoPerfData by taking a random point and using a particular configured radius. A pure lat-lon bounding box query is ultimately done.
          ** query.geo.radiuskm: The radius of the query in kilometers.
          * schema.xml -- a Solr schema file to configure SpatialQueriable fields referenced by doc.geo.schemaField.

          When I run this algorithm as provided with the file in the patch, I get this result:
          {noformat}
          Operation round ____km runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          Search_40 0 350 1 4811687 1,206,541.38 3.99 117,722,664 191,934,464
          {noformat}

          The key metrics I use are the average milliseconds per query, and the average places per query. The number of queries performed is the trailing numeric suffix to Operation. The Formulas:
          * avg ms/query: elapsedSec*1000/queries == 98.8
          * avg places / query: recsPerRun/queries == 120,292
          See comments for details.
          In particular, the original patch "benchmark-geo.patch" is fairly different than LUCENE-2844.patch
          David Smiley made changes -
          Attachment LUCENE-2844_spatial_benchmark.patch [ 12605104 ]
          David Smiley made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          David Smiley made changes -
          Component/s modules/spatial [ 12312623 ]
          David Smiley made changes -
          Attachment LUCENE-2844_spatial_benchmark.patch [ 12609887 ]
          David Smiley made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development