Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
New
Description
Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
numeric fields, but this is a pretty big hit on indexing speed and
disk usage, especially for tiny documents, because it creates many (8
or 16) terms for each value.
Since we originally set these defaults, a lot has changed... e.g. we
now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
a faster postings format, etc.
Index size is important because it limits how much of the index will
be hot (fit in the OS's IO cache). And more apps are using Lucene for
tiny docs where the overhead of individual fields is sizable.
I used the Geonames corpus to run a simple benchmark (all sources are
committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
with these numeric fields:
- lat/lng (double)
- modified time, elevation, population (long)
- dem (int)
I tested 4, 8 and 16 precision steps:
indexing: PrecStep Size IndexTime 4 1812.7 MB 651.4 sec 8 1203.0 MB 443.2 sec 16 894.3 MB 361.6 sec searching: Field PrecStep QueryTime TermCount geoNameID 4 2872.5 ms 20306 geoNameID 8 2903.3 ms 104856 geoNameID 16 3371.9 ms 5871427 latitude 4 2160.1 ms 36805 latitude 8 2249.0 ms 240655 latitude 16 2725.9 ms 4649273 modified 4 2038.3 ms 13311 modified 8 2029.6 ms 58344 modified 16 2060.5 ms 77763 longitude 4 3468.5 ms 33818 longitude 8 3629.9 ms 214863 longitude 16 4060.9 ms 4532032
Index time is with 1 thread (for identical index structure).
The query time is time to run 100 random ranges for that field,
averaged over 20 iterations. TermCount is the total number of terms
the MTQ rewrote to across all 100 queries / segments, and it gets
higher as expected as precStep gets higher, but the search time is not
that heavily impacted ... negligible going from 4 to 8, and then some
impact from 8 to 16.
Maybe we should increase the int/float default precision step to 8 and
long/double to 16? Or both to 16?