Lucene - Core
  1. Lucene - Core
  2. LUCENE-5609

Should we revisit the default numeric precision step?

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
      numeric fields, but this is a pretty big hit on indexing speed and
      disk usage, especially for tiny documents, because it creates many (8
      or 16) terms for each value.

      Since we originally set these defaults, a lot has changed... e.g. we
      now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
      a faster postings format, etc.

      Index size is important because it limits how much of the index will
      be hot (fit in the OS's IO cache). And more apps are using Lucene for
      tiny docs where the overhead of individual fields is sizable.

      I used the Geonames corpus to run a simple benchmark (all sources are
      committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
      with these numeric fields:

      • lat/lng (double)
      • modified time, elevation, population (long)
      • dem (int)

      I tested 4, 8 and 16 precision steps:

      indexing:
      
      PrecStep        Size        IndexTime
             4   1812.7 MB        651.4 sec
             8   1203.0 MB        443.2 sec
            16    894.3 MB        361.6 sec
      
      
      searching:
      
           Field  PrecStep   QueryTime   TermCount
       geoNameID         4   2872.5 ms       20306
       geoNameID         8   2903.3 ms      104856
       geoNameID        16   3371.9 ms     5871427
        latitude         4   2160.1 ms       36805
        latitude         8   2249.0 ms      240655
        latitude        16   2725.9 ms     4649273
        modified         4   2038.3 ms       13311
        modified         8   2029.6 ms       58344
        modified        16   2060.5 ms       77763
       longitude         4   3468.5 ms       33818
       longitude         8   3629.9 ms      214863
       longitude        16   4060.9 ms     4532032
      

      Index time is with 1 thread (for identical index structure).

      The query time is time to run 100 random ranges for that field,
      averaged over 20 iterations. TermCount is the total number of terms
      the MTQ rewrote to across all 100 queries / segments, and it gets
      higher as expected as precStep gets higher, but the search time is not
      that heavily impacted ... negligible going from 4 to 8, and then some
      impact from 8 to 16.

      Maybe we should increase the int/float default precision step to 8 and
      long/double to 16? Or both to 16?

        Activity

        Hide
        Michael McCandless added a comment -

        Another test, this time on a biggish (~1B docs) simulated timestamps
        over a 1 day range with msec precision (benchmark sources are in
        luceneutil):

        indexing:
        
        PrecStep        Size    IndexTime
               4      6.7 GB     2482 sec
               8      4.0 GB     2229 sec
              16      2.8 GB     2213 sec
        

        Curiously index speed did not change much (this test only indexes the
        one LongField), but I did re-use the LongField instance across
        addDocument calls, and I used 12 indexing threads.

        searching:
        
        PrecStep     QueryTime         TermCount
               4      29.7 sec        1285 terms
               8      29.9 sec       11216 terms
              16      30.6 sec     1410453 terms
        

        Much smaller slowdown in the queries as precStep is increased ... I
        suspect because the numbers are so "dense" and many docs have the
        timestamp, but still a big reduction in index size.

        Net/net I think we should increase the default precStep?

        Show
        Michael McCandless added a comment - Another test, this time on a biggish (~1B docs) simulated timestamps over a 1 day range with msec precision (benchmark sources are in luceneutil): indexing: PrecStep Size IndexTime 4 6.7 GB 2482 sec 8 4.0 GB 2229 sec 16 2.8 GB 2213 sec Curiously index speed did not change much (this test only indexes the one LongField), but I did re-use the LongField instance across addDocument calls, and I used 12 indexing threads. searching: PrecStep QueryTime TermCount 4 29.7 sec 1285 terms 8 29.9 sec 11216 terms 16 30.6 sec 1410453 terms Much smaller slowdown in the queries as precStep is increased ... I suspect because the numbers are so "dense" and many docs have the timestamp, but still a big reduction in index size. Net/net I think we should increase the default precStep?
        Hide
        Robert Muir added a comment -

        +1, I think the 4.0 MTQ rework made a lot of difference here.

        One day we should followup with re-examination of AUTOREWRITE too, but this one is much more important because of index size, etc.

        Show
        Robert Muir added a comment - +1, I think the 4.0 MTQ rework made a lot of difference here. One day we should followup with re-examination of AUTOREWRITE too, but this one is much more important because of index size, etc.
        Hide
        Paul Elschot added a comment -

        When the current implementation can only handle precision steps that are powers of 2, going to default 8 looks good to me.

        Anyway these results make me curious about precision steps 6 as searching speed default, and 11 as indexing speed default.

        Show
        Paul Elschot added a comment - When the current implementation can only handle precision steps that are powers of 2, going to default 8 looks good to me. Anyway these results make me curious about precision steps 6 as searching speed default, and 11 as indexing speed default.
        Hide
        Robert Muir added a comment -

        To be simple, i think a change we make has to be a multiple of 4: the current default.

        This way we have easy backwards compat with existing indexes for users that just go with the default (since NumericRangeQuery has ctors that use the default precisionStep, and multiples of the index-time work at query-time).

        Otherwise, it could cause a lot of confusion.

        Show
        Robert Muir added a comment - To be simple, i think a change we make has to be a multiple of 4: the current default. This way we have easy backwards compat with existing indexes for users that just go with the default (since NumericRangeQuery has ctors that use the default precisionStep, and multiples of the index-time work at query-time). Otherwise, it could cause a lot of confusion.
        Hide
        Paul Elschot added a comment -

        Backward compat is indeed a strong point. It's also nice to return to 8.

        Show
        Paul Elschot added a comment - Backward compat is indeed a strong point. It's also nice to return to 8.
        Hide
        Uwe Schindler added a comment -

        I would use precStep 8 for ints (Solr does this already by default). As we need a multiple of 4: Mike: Can you check number of terms and index size for precStep=12? 16 is way too big in my opinion and my tests in the past..
        The overhead is not soo big as you might think. The problem is if you have an index solely of numerics. To have a real comparison, you should use something like Wikipedia and maybe add something like the lastmod date as long field. And then test
        Also we have lots of queries with at least up to 8 different numeric fields in parallel (half open ranges). For that there is still ahuge improvement with lower prec steps. I found out that 8 is best. 16 hurts very much, if you query multiple numeric fields anded/ored together.
        Also not everybody has the index completely in memory! If you have a pure in-memory index, you could theoretically also disable tries completely The numeric fields are made for indexes with lots of disk/ssd IO (because you have many numeric fields combined with simple full text queries and some facets.

        So please also check complex queries on really large indexes, not just simple range filters on small indexes with solely numeric fields.

        Show
        Uwe Schindler added a comment - I would use precStep 8 for ints (Solr does this already by default). As we need a multiple of 4: Mike: Can you check number of terms and index size for precStep=12? 16 is way too big in my opinion and my tests in the past.. The overhead is not soo big as you might think. The problem is if you have an index solely of numerics. To have a real comparison, you should use something like Wikipedia and maybe add something like the lastmod date as long field. And then test Also we have lots of queries with at least up to 8 different numeric fields in parallel (half open ranges). For that there is still ahuge improvement with lower prec steps. I found out that 8 is best. 16 hurts very much, if you query multiple numeric fields anded/ored together. Also not everybody has the index completely in memory! If you have a pure in-memory index, you could theoretically also disable tries completely The numeric fields are made for indexes with lots of disk/ssd IO (because you have many numeric fields combined with simple full text queries and some facets. So please also check complex queries on really large indexes, not just simple range filters on small indexes with solely numeric fields.
        Hide
        Uwe Schindler added a comment -

        To just explain, why you might have mutiple numeric fields and multiple queries: I have a customer with date ranges (they use precStep 8, 16 hurted much with ElasticSearch for a 100 GB index) and also for geo search here in-house (PANGAEA). If you have something like overlapping ranges, you need 2 queries with half open ranges. For example you have a date range on each document (start/end date of validity). The query on the index is also a date range and you want to find all documents that have overlapping ranges (validity range of document overlaps date range of query). In that case you need 2 half open queries (which are expensive with large precision steps). For stuff like bounding boxes in geo you might need if the bounding box of the document overlaps the bounding box of the query (Google Maps like query). Here you have 4 half open ranges, which almost always hit half of all your documents). With large precsteps this takes looooooooooong. So 8 is a good default, for my customer 16 took like 4 times as long as 8 (becausde of the half open ranges). With smaller precSteps half open ranges are very simple.

        With geonames you can check this: geonames have in most cases bounding boxes assigned and you want to search with bounding boxes, too. This is my example above. And those ranges (unless you want to find all documents completely inside the query range) are always 4 half open ones each hitting half of all documents. By anding them together, you later get the real results (conjunctionscorer).

        Show
        Uwe Schindler added a comment - To just explain, why you might have mutiple numeric fields and multiple queries: I have a customer with date ranges (they use precStep 8, 16 hurted much with ElasticSearch for a 100 GB index) and also for geo search here in-house (PANGAEA). If you have something like overlapping ranges, you need 2 queries with half open ranges. For example you have a date range on each document (start/end date of validity). The query on the index is also a date range and you want to find all documents that have overlapping ranges (validity range of document overlaps date range of query). In that case you need 2 half open queries (which are expensive with large precision steps). For stuff like bounding boxes in geo you might need if the bounding box of the document overlaps the bounding box of the query (Google Maps like query). Here you have 4 half open ranges, which almost always hit half of all your documents). With large precsteps this takes looooooooooong. So 8 is a good default, for my customer 16 took like 4 times as long as 8 (becausde of the half open ranges). With smaller precSteps half open ranges are very simple. With geonames you can check this: geonames have in most cases bounding boxes assigned and you want to search with bounding boxes, too. This is my example above. And those ranges (unless you want to find all documents completely inside the query range) are always 4 half open ones each hitting half of all documents. By anding them together, you later get the real results (conjunctionscorer).
        Hide
        Robert Muir added a comment -

        I see your point Uwe, however we should remember that these are the defaults for all numeric fields. In other words, the field is named IntField and LongField and so on.

        This is a very general thing to the user, like a primitive data type. In fact the user may not use ranges at all, forget about complex intensive geospatial half-open ones. They might just have a numeric field for some identifier, or a simple count, or whatever.

        So I feel the default precisionStep should reflect this: it should make the right tradeoffs of index time and space for range query performance, keeping in mind that its just a general numeric type and the user may not even be interested in ranges at all.

        Show
        Robert Muir added a comment - I see your point Uwe, however we should remember that these are the defaults for all numeric fields. In other words, the field is named IntField and LongField and so on. This is a very general thing to the user, like a primitive data type. In fact the user may not use ranges at all, forget about complex intensive geospatial half-open ones. They might just have a numeric field for some identifier, or a simple count, or whatever. So I feel the default precisionStep should reflect this: it should make the right tradeoffs of index time and space for range query performance, keeping in mind that its just a general numeric type and the user may not even be interested in ranges at all.
        Hide
        Robert Muir added a comment -

        ok, here is an idea for a compromise:

        This patch sets a default of 16 for 64-bit types, and 8 for 32-bit types. Its pretty simple, because there is type safety everywhere.

        Show
        Robert Muir added a comment - ok, here is an idea for a compromise: This patch sets a default of 16 for 64-bit types, and 8 for 32-bit types. Its pretty simple, because there is type safety everywhere.
        Hide
        Michael McCandless added a comment -

        +1 for 8/16.

        Show
        Michael McCandless added a comment - +1 for 8/16.
        Hide
        Paul Elschot added a comment -

        Going from 4 to 16 for the 64 bit types is a very large step.
        Wouldn't it be better to do that in more steps and only take a step from 4 to 8 now?

        I think 11 is better than 12.
        Both have an indexing cost of 3 indexed terms for 32 bits (10/11/11 and 8/12/12 precision bits per term).
        11 should be faster at searching because it involves less terms. For a single ended range, the expected number of terms for these cases is about half of:

         (2**10 + 2**11 + 2**11) < (2**8 + 2**12 + 2**12) 

        Whether that difference is actually noticeable remains to be seen.

        Independent of the precision step, geohashes from the spatial module might help to avoid range subqueries that have large results.

        Show
        Paul Elschot added a comment - Going from 4 to 16 for the 64 bit types is a very large step. Wouldn't it be better to do that in more steps and only take a step from 4 to 8 now? I think 11 is better than 12. Both have an indexing cost of 3 indexed terms for 32 bits (10/11/11 and 8/12/12 precision bits per term). 11 should be faster at searching because it involves less terms. For a single ended range, the expected number of terms for these cases is about half of: (2**10 + 2**11 + 2**11) < (2**8 + 2**12 + 2**12) Whether that difference is actually noticeable remains to be seen. Independent of the precision step, geohashes from the spatial module might help to avoid range subqueries that have large results.
        Hide
        Robert Muir added a comment -

        Its not a large step, its that 4 was ridiculously small before.

        Show
        Robert Muir added a comment - Its not a large step, its that 4 was ridiculously small before.
        Hide
        Paul Elschot added a comment -

        Have a look at LUCENE-1470, even 2 was considered then.

        Show
        Paul Elschot added a comment - Have a look at LUCENE-1470 , even 2 was considered then.
        Hide
        Robert Muir added a comment -

        The old discussions and benchmarks are irrelevant: the execution of multiterm queries and index encoding has changed substantially since then. That's the point of changing the defaults to reflect reality, we need not quadruple users indexes needlessly anymore.

        Show
        Robert Muir added a comment - The old discussions and benchmarks are irrelevant: the execution of multiterm queries and index encoding has changed substantially since then. That's the point of changing the defaults to reflect reality, we need not quadruple users indexes needlessly anymore.
        Hide
        Uwe Schindler added a comment - - edited

        Have a look at LUCENE-1470, even 2 was considered then.

        That was not really useable even at that time! The improvements in contrast to 4 were zero. It was even worse (because the term dictionary got larger, which had impact in 2.x and 3.x. At that time, I was always using 8 as precisionStep for longs and ints. The same applied for Solr. Lucene was the only one using 4 as default. ElasticSearch was cloning Lucene's standards.

        I would really prefer to use 8 for both ints and longs. The change from 8 to 16 is increasing the number of terms while range query immense and the index size between 8 and 16 is not really a problem. To me it has also shown that because of the way how floats/doubles are encoded, the precision step of 8 is really good for longs. In most cases stuff never changes (like exponent), so there is exactly one term indexed for that.

        With a precision step of 16 I would imagine the differences between 16 and 64 would be neglectible, too The main reason for having lower precision steps are indexes were the values are equally distributed. For stuff like values clustered around some numbers, the precisionstep is irrelevant! In most cases because the way how it works, for larger shifts the indexed value is constant, so you have one or 2 terms that hit all documents and are never used by the range query..

        So before changing the default, I would suggest to have a test with an index that has equally distributed numbers of the full 64 bit range.

        I think 11 is better than 12

        ...because the last term is better used. The number of terms indexed is the same for 11 and 12 (6*11=66, 6*12=72, but 5*12=60 is too small). But unfortunately this is not a multiple of 4, so would not be backwards compatible.

        I think the main problem of this issue is, that we only have one default. Sombeody never doing any ranges does not need the additional terms at all. That's the main problem. Solr is better here, as it provided 2 predefined field types, but Lucene only has one - and that is the bug.

        So my proposal: Provide a 2nd field type as a 2nd default with correct documnetation, suggesting it to users, only wanting to index numeric identifiers or non-docvalues fields they want to sort on.

        And second, we should do LUCENE-5605 - I started with it last week, but was interrupted by NativeFSIndexCorrumpter The problem is the precisionStep alltogether! We should make it an implementation detail. When constructing a NRQ, you should not need to pass it. Because of this I opened LUCENE-5605, so anybody creating a NRQ/NRF should pass the FieldType to the NRQ ctor, not an arbitrary number. Then its ensured that the people use the same settings for indexing and querying.

        Together with this, we should provide 2 predfined field types per data type and remove the constant from NumericUtils completely. The 2 field types per data type might be named like DEFAULT_INT_FOR_RANGEQUERY_FILEDTYPE and DEFAULT_INT_OTHERIWSE_FIELDTYPE (please choose better names and javadocs). And we should make 8 the new default, which is fully backwards compatible. And hide the precision step completely! 16 is really too large for lots of queries. And difference in index size is neglectibale, unless you have a purely numeric index (in which case you should use a RDBMS instead of an Lucene index to query your data !). Indexing time is also, as Mike discovered not a problem at all. If people don't reuse the IntField instance, its always as slow, because the TokenStream has to be recreated on every number. The number of terms is not the issue at all, sorry!

        About ElasticSearch: Unfortunately the schemaless mode of ElasticSearch always uses 4 as precStep if it detects a numeric or date type. ES should change this, but maybe have a bit more intelligent "guessing". E.g., If you index the "_id" field as an integer, it should automatically use infinite (DEFAULT_INT_OTHERIWSE_TYPE) precStep - nobody would do range queries on the "_id" field. For all standard numeric fields it should use precstep=8.

        Show
        Uwe Schindler added a comment - - edited Have a look at LUCENE-1470 , even 2 was considered then. That was not really useable even at that time! The improvements in contrast to 4 were zero. It was even worse (because the term dictionary got larger, which had impact in 2.x and 3.x. At that time, I was always using 8 as precisionStep for longs and ints. The same applied for Solr. Lucene was the only one using 4 as default. ElasticSearch was cloning Lucene's standards. I would really prefer to use 8 for both ints and longs. The change from 8 to 16 is increasing the number of terms while range query immense and the index size between 8 and 16 is not really a problem. To me it has also shown that because of the way how floats/doubles are encoded, the precision step of 8 is really good for longs. In most cases stuff never changes (like exponent), so there is exactly one term indexed for that. With a precision step of 16 I would imagine the differences between 16 and 64 would be neglectible, too The main reason for having lower precision steps are indexes were the values are equally distributed. For stuff like values clustered around some numbers, the precisionstep is irrelevant! In most cases because the way how it works, for larger shifts the indexed value is constant, so you have one or 2 terms that hit all documents and are never used by the range query.. So before changing the default, I would suggest to have a test with an index that has equally distributed numbers of the full 64 bit range. I think 11 is better than 12 ...because the last term is better used. The number of terms indexed is the same for 11 and 12 (6*11=66, 6*12=72, but 5*12=60 is too small). But unfortunately this is not a multiple of 4, so would not be backwards compatible. I think the main problem of this issue is, that we only have one default. Sombeody never doing any ranges does not need the additional terms at all. That's the main problem. Solr is better here, as it provided 2 predefined field types, but Lucene only has one - and that is the bug. So my proposal: Provide a 2nd field type as a 2nd default with correct documnetation, suggesting it to users, only wanting to index numeric identifiers or non-docvalues fields they want to sort on. And second, we should do LUCENE-5605 - I started with it last week, but was interrupted by NativeFSIndexCorrumpter The problem is the precisionStep alltogether! We should make it an implementation detail. When constructing a NRQ, you should not need to pass it. Because of this I opened LUCENE-5605 , so anybody creating a NRQ/NRF should pass the FieldType to the NRQ ctor, not an arbitrary number. Then its ensured that the people use the same settings for indexing and querying. Together with this, we should provide 2 predfined field types per data type and remove the constant from NumericUtils completely. The 2 field types per data type might be named like DEFAULT_INT_FOR_RANGEQUERY_FILEDTYPE and DEFAULT_INT_OTHERIWSE_FIELDTYPE (please choose better names and javadocs). And we should make 8 the new default, which is fully backwards compatible. And hide the precision step completely! 16 is really too large for lots of queries. And difference in index size is neglectibale, unless you have a purely numeric index (in which case you should use a RDBMS instead of an Lucene index to query your data !). Indexing time is also, as Mike discovered not a problem at all. If people don't reuse the IntField instance, its always as slow, because the TokenStream has to be recreated on every number. The number of terms is not the issue at all, sorry! About ElasticSearch: Unfortunately the schemaless mode of ElasticSearch always uses 4 as precStep if it detects a numeric or date type. ES should change this, but maybe have a bit more intelligent "guessing". E.g., If you index the "_id" field as an integer, it should automatically use infinite (DEFAULT_INT_OTHERIWSE_TYPE) precStep - nobody would do range queries on the "_id" field. For all standard numeric fields it should use precstep=8.
        Hide
        Robert Muir added a comment -

        I think the main problem of this issue is, that we only have one default. Sombeody never doing any ranges does not need the additional terms at all. That's the main problem. Solr is better here, as it provided 2 predefined field types, but Lucene only has one - and that is the bug.

        Well, I kind of agree, but in a different way.

        In my opinion the default numeric types (intfield, longfield, floatfield, doublefield) should have good defaults for general-purpose use. This includes range queries: they should work "reasonably" well out of box. Users that dont need range queries can optimize by changing to Infinity. Along the same lines, they also dont need to be super-optimized for "hardcore" esoteric uses of range queries. Thats what defaults are, just making the right tradeoffs for out-of-box use.

        I would not be happy if these fields default to precisionStep=Infinity either, because thats also a bad default for general purpose use, just in the opposite direction of precisionStep=4.

        I am fine with precisionStep=8 as the new default for both, but I don't think its the best idea. I think 16 for the 64-bit types are nice because its easy to understand "4 terms for each value". Today its 8 terms for each value (32-bit field), and 16 terms for each value (64-bit field).

        I also think we should be able to add new types in the future (e.g. 16-bit short and half-float) and give them different defaults too. So, I don't understand the need for a "one-size-fits-all" default.

        Show
        Robert Muir added a comment - I think the main problem of this issue is, that we only have one default. Sombeody never doing any ranges does not need the additional terms at all. That's the main problem. Solr is better here, as it provided 2 predefined field types, but Lucene only has one - and that is the bug. Well, I kind of agree, but in a different way. In my opinion the default numeric types (intfield, longfield, floatfield, doublefield) should have good defaults for general-purpose use. This includes range queries: they should work "reasonably" well out of box. Users that dont need range queries can optimize by changing to Infinity. Along the same lines, they also dont need to be super-optimized for "hardcore" esoteric uses of range queries. Thats what defaults are, just making the right tradeoffs for out-of-box use. I would not be happy if these fields default to precisionStep=Infinity either, because thats also a bad default for general purpose use, just in the opposite direction of precisionStep=4. I am fine with precisionStep=8 as the new default for both, but I don't think its the best idea. I think 16 for the 64-bit types are nice because its easy to understand "4 terms for each value". Today its 8 terms for each value (32-bit field), and 16 terms for each value (64-bit field). I also think we should be able to add new types in the future (e.g. 16-bit short and half-float) and give them different defaults too. So, I don't understand the need for a "one-size-fits-all" default.
        Hide
        Michael McCandless added a comment -

        I think 8/16 (4 terms for int/float and also 4 terms for long/double) is a better general purpose default than 8/8.

        I think testing on randomly distributed longs is too synthetic? Most real-world data is much more restricted in practice, and those exception cases can re-tune precisionStep to meet their cases.

        Indexing time is also, as Mike discovered not a problem at all. If people don't reuse the IntField instance, its always as slow, because the TokenStream has to be recreated on every number. The number of terms is not the issue at all, sorry!

        Really apps should not have to re-use Field instances to get good indexing performance. In LUCENE-5611 I saw big gains by "specializing" how untokenized an numeric fields are indexed, and I think we should somehow do this (separately).

        But the number of terms is a problem: this increases indexing time, size, and doesn't buy that much of a speedup for searching.

        Show
        Michael McCandless added a comment - I think 8/16 (4 terms for int/float and also 4 terms for long/double) is a better general purpose default than 8/8. I think testing on randomly distributed longs is too synthetic? Most real-world data is much more restricted in practice, and those exception cases can re-tune precisionStep to meet their cases. Indexing time is also, as Mike discovered not a problem at all. If people don't reuse the IntField instance, its always as slow, because the TokenStream has to be recreated on every number. The number of terms is not the issue at all, sorry! Really apps should not have to re-use Field instances to get good indexing performance. In LUCENE-5611 I saw big gains by "specializing" how untokenized an numeric fields are indexed, and I think we should somehow do this (separately). But the number of terms is a problem: this increases indexing time, size, and doesn't buy that much of a speedup for searching.
        Hide
        David Smiley added a comment -

        I think testing on randomly distributed longs is too synthetic? Most real-world data is much more restricted in practice, and those exception cases can re-tune precisionStep to meet their cases.

        Agreed – real-world data is definitely much more restricted in practice.

        I wish the precisionStep was variable. If it were, I'd usually configure the precisionIncrement step to be 16,8,8,8,8,16 for doubles & longs. Variable prefix-tree precision is definitely a goal of LUCENE-4922 in the spatial module. At the very high level, it's extremely rare to do gigantic continent-spanning queries, so at that level I'd like many cells (corresponds to a high precision step in trie numeric fields). And at the bottom levels, it's fastest to scan() instead of seek() because there is a limited amount of data once you get down low enough. So preferably fewer intermediate aggregate cells down there.

        Show
        David Smiley added a comment - I think testing on randomly distributed longs is too synthetic? Most real-world data is much more restricted in practice, and those exception cases can re-tune precisionStep to meet their cases. Agreed – real-world data is definitely much more restricted in practice. I wish the precisionStep was variable. If it were, I'd usually configure the precisionIncrement step to be 16,8,8,8,8,16 for doubles & longs. Variable prefix-tree precision is definitely a goal of LUCENE-4922 in the spatial module. At the very high level, it's extremely rare to do gigantic continent-spanning queries, so at that level I'd like many cells (corresponds to a high precision step in trie numeric fields). And at the bottom levels, it's fastest to scan() instead of seek() because there is a limited amount of data once you get down low enough. So preferably fewer intermediate aggregate cells down there.
        Hide
        Michael McCandless added a comment -

        I think we should do something here for 4.9; poor defaults just hurt our users.

        I'd like to do 8/16, but Uwe are you completely against this?

        Show
        Michael McCandless added a comment - I think we should do something here for 4.9; poor defaults just hurt our users. I'd like to do 8/16, but Uwe are you completely against this?
        Hide
        Michael McCandless added a comment -

        I plan to commit 8/16 soon ...

        Show
        Michael McCandless added a comment - I plan to commit 8/16 soon ...
        Hide
        ASF subversion and git services added a comment -

        Commit 1592485 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1592485 ]

        LUCENE-5609: increase default NumericField precStep

        Show
        ASF subversion and git services added a comment - Commit 1592485 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1592485 ] LUCENE-5609 : increase default NumericField precStep
        Hide
        ASF subversion and git services added a comment -

        Commit 1592521 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1592521 ]

        LUCENE-5609: increase default NumericField precStep

        Show
        ASF subversion and git services added a comment - Commit 1592521 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1592521 ] LUCENE-5609 : increase default NumericField precStep

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development