Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5609

Should we revisit the default numeric precision step?

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 4.9, 6.0
    • core/search
    • None
    • New

    Description

      Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
      numeric fields, but this is a pretty big hit on indexing speed and
      disk usage, especially for tiny documents, because it creates many (8
      or 16) terms for each value.

      Since we originally set these defaults, a lot has changed... e.g. we
      now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
      a faster postings format, etc.

      Index size is important because it limits how much of the index will
      be hot (fit in the OS's IO cache). And more apps are using Lucene for
      tiny docs where the overhead of individual fields is sizable.

      I used the Geonames corpus to run a simple benchmark (all sources are
      committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
      with these numeric fields:

      • lat/lng (double)
      • modified time, elevation, population (long)
      • dem (int)

      I tested 4, 8 and 16 precision steps:

      indexing:
      
      PrecStep        Size        IndexTime
             4   1812.7 MB        651.4 sec
             8   1203.0 MB        443.2 sec
            16    894.3 MB        361.6 sec
      
      
      searching:
      
           Field  PrecStep   QueryTime   TermCount
       geoNameID         4   2872.5 ms       20306
       geoNameID         8   2903.3 ms      104856
       geoNameID        16   3371.9 ms     5871427
        latitude         4   2160.1 ms       36805
        latitude         8   2249.0 ms      240655
        latitude        16   2725.9 ms     4649273
        modified         4   2038.3 ms       13311
        modified         8   2029.6 ms       58344
        modified        16   2060.5 ms       77763
       longitude         4   3468.5 ms       33818
       longitude         8   3629.9 ms      214863
       longitude        16   4060.9 ms     4532032
      

      Index time is with 1 thread (for identical index structure).

      The query time is time to run 100 random ranges for that field,
      averaged over 20 iterations. TermCount is the total number of terms
      the MTQ rewrote to across all 100 queries / segments, and it gets
      higher as expected as precStep gets higher, but the search time is not
      that heavily impacted ... negligible going from 4 to 8, and then some
      impact from 8 to 16.

      Maybe we should increase the int/float default precision step to 8 and
      long/double to 16? Or both to 16?

      Attachments

        1. LUCENE-5609.patch
          12 kB
          Robert Muir

        Activity

          People

            Unassigned Unassigned
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: