Lucene - Core
  1. Lucene - Core
  2. LUCENE-1712

Set default precisionStep for NumericField and NumericRangeFilter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is a spinoff from LUCENE-1701.

      A user using Numeric* should not need to understand what's
      "under the hood" in order to do their indexing & searching.

      They should be able to simply:

      doc.add(new NumericField("price", 15.50);
      

      And have a decent default precisionStep selected for them.

      Actually, if we add ctors to NumericField for each of the supported
      types (so the above code works), we can set the default per-type. I
      think we should do that?

      4 for int and 6 for long was proposed as good defaults.

      The default need not be "perfect", as advanced users can always
      optimize their precisionStep, and for users experiencing slow
      RangeQuery performance, NumericRangeQuery with any of the defaults we
      are discussing will be much faster.

      1. LUCENE-1712.patch
        41 kB
        Uwe Schindler
      2. LUCENE-1712.patch
        29 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Should be:

        doc.add(new NumericField("price").setFloatValue(15.50f));
        

        With a direct constructor, there would be the problem of missing data type information.

        Show
        Uwe Schindler added a comment - Should be: doc.add( new NumericField( "price" ).setFloatValue(15.50f)); With a direct constructor, there would be the problem of missing data type information.
        Hide
        Michael McCandless added a comment -

        With a direct constructor, there would be the problem of missing data type information.

        Sorry, what does that mean?

        Show
        Michael McCandless added a comment - With a direct constructor, there would be the problem of missing data type information. Sorry, what does that mean?
        Hide
        Uwe Schindler added a comment -

        The methods setFloatValue() etc are for specifying the exact data type. If you would do it in the constructor, the resulting code would be very error-prone. e.g., new NumericField("price", 12.5): Does this mean 12.5 as double or float, is 15 alone meant as (in future) short, byte or int? OK users can add "f" in the float case to it, but this makes it very hard to prevent errors, because Java automatically casts all numeric types to each other suddenly. This is why I added these factories for NumericRangeQuery and the setters here.

        Show
        Uwe Schindler added a comment - The methods setFloatValue() etc are for specifying the exact data type. If you would do it in the constructor, the resulting code would be very error-prone. e.g., new NumericField("price", 12.5): Does this mean 12.5 as double or float, is 15 alone meant as (in future) short, byte or int? OK users can add "f" in the float case to it, but this makes it very hard to prevent errors, because Java automatically casts all numeric types to each other suddenly. This is why I added these factories for NumericRangeQuery and the setters here.
        Hide
        Michael McCandless added a comment -

        OK, I agree it's dangerous to let javac auto cast.

        So, can we set an across the board default of 4?

        Show
        Michael McCandless added a comment - OK, I agree it's dangerous to let javac auto cast. So, can we set an across the board default of 4?
        Hide
        Earwin Burrfoot added a comment -

        Am I misunderstanding something or the problem still persists?
        Even if you use a common default, what is your base type - int or long? Are floats converted to ints, or to longs?

        Show
        Earwin Burrfoot added a comment - Am I misunderstanding something or the problem still persists? Even if you use a common default, what is your base type - int or long? Are floats converted to ints, or to longs?
        Hide
        Uwe Schindler added a comment -

        I think this is simple, we just overload all ctors and factories and remove the precisionStep there. These methods use NumericUtils.DEFAULT_PRECISION_STEP = 4 then. But thee should be a clear not, to do also tests with other step values.

        additional variants:

        • NumericField(name), NumericField(name,store,index)
        • NumericTokenStream()
        • NumericRangeQuery.newXxxRange(field, min, max,...)

        I am currentlly not sure (I was thinking the whole time during including into core) to also make NumericRangeQuery work like the other two classes: generic Constructor without datatype and then set the range explicit:

        new NumericRangeQuery(fieldname[, precisionStep]).setFloatRange(....)
        

        Not sure, in this case the API where similar and I have to override only one ctor for different construction parameters. My only problem is, that Queries are normally almost everywhere in Lucene static and unmodifable (beyond boost).

        Show
        Uwe Schindler added a comment - I think this is simple, we just overload all ctors and factories and remove the precisionStep there. These methods use NumericUtils.DEFAULT_PRECISION_STEP = 4 then. But thee should be a clear not, to do also tests with other step values. additional variants: NumericField(name), NumericField(name,store,index) NumericTokenStream() NumericRangeQuery.newXxxRange(field, min, max,...) I am currentlly not sure (I was thinking the whole time during including into core) to also make NumericRangeQuery work like the other two classes: generic Constructor without datatype and then set the range explicit: new NumericRangeQuery(fieldname[, precisionStep]).setFloatRange(....) Not sure, in this case the API where similar and I have to override only one ctor for different construction parameters. My only problem is, that Queries are normally almost everywhere in Lucene static and unmodifable (beyond boost).
        Hide
        Uwe Schindler added a comment -

        Even if you use a common default, what is your base type - int or long? Are floats converted to ints, or to longs?

        Float are indexed like ints and doubles like longs.

        The problem here is more that if you would specify the value direct in the constructor, you cannot for sure always give the right type (because Java auto-casts). This is why I have these setFloatValue(), setLongValue() and so on.

        Show
        Uwe Schindler added a comment - Even if you use a common default, what is your base type - int or long? Are floats converted to ints, or to longs? Float are indexed like ints and doubles like longs. The problem here is more that if you would specify the value direct in the constructor, you cannot for sure always give the right type (because Java auto-casts). This is why I have these setFloatValue(), setLongValue() and so on.
        Hide
        Earwin Burrfoot added a comment -

        Aha! And each time you invoke setFloatValue/setDoubleValue it switches base type behind the scenes? Eeek.

        Show
        Earwin Burrfoot added a comment - Aha! And each time you invoke setFloatValue/setDoubleValue it switches base type behind the scenes? Eeek.
        Hide
        Michael McCandless added a comment -

        each time you invoke setFloatValue/setDoubleValue it switches base type behind the scenes?

        I think that's an acceptable risk. I suppose we could add checking to catch you but I don't think that's needed (we should document clearly that you can't "mix types").

        Show
        Michael McCandless added a comment - each time you invoke setFloatValue/setDoubleValue it switches base type behind the scenes? I think that's an acceptable risk. I suppose we could add checking to catch you but I don't think that's needed (we should document clearly that you can't "mix types").
        Hide
        Hoss Man added a comment -

        behind the scenes precision changes based on which set*Value() method is called smells really wrong.

        I'm not overly familiar with NumericField, but i'm i'm understanding the current suggestion, wouldn't that mkae situations like this come up...

        NumericField a = new NumericField("price", MY_CUSTOM_PRECISION_STEP, ...);
        a.setFloatValue(23.4f); // blows away my custom precision
        
        NumericField b = new NumericField("price", ...);
        b.setPrecisionStep(MY_CUSTOM_PRECISION_STEP);
        b.setFloatValue(23.4f); // blows away my custom precision
        
        NumericField c = new NumericField("price", ...);
        c.setFloatValue(23.4f); 
        c.setPrecisionStep(MY_CUSTOM_PRECISION_STEP); // only way to get my value used
        

        ...that seems sketchy, and really anoying if people try reusing NumericField instances.

        If the goal is to have good "defaults" based on type then why not just have a constant per type that people can refer to explicitly? if they don't know what number to pick ... as well as a true "default" if they pick nothing.

        int DEFAULT_STEP = ...;
        int SUGGESTED_INT_STEP = ...;
        int SUGGESTED_FLOAT_STEP = ...;
        
        Show
        Hoss Man added a comment - behind the scenes precision changes based on which set*Value() method is called smells really wrong. I'm not overly familiar with NumericField, but i'm i'm understanding the current suggestion, wouldn't that mkae situations like this come up... NumericField a = new NumericField( "price" , MY_CUSTOM_PRECISION_STEP, ...); a.setFloatValue(23.4f); // blows away my custom precision NumericField b = new NumericField( "price" , ...); b.setPrecisionStep(MY_CUSTOM_PRECISION_STEP); b.setFloatValue(23.4f); // blows away my custom precision NumericField c = new NumericField( "price" , ...); c.setFloatValue(23.4f); c.setPrecisionStep(MY_CUSTOM_PRECISION_STEP); // only way to get my value used ...that seems sketchy, and really anoying if people try reusing NumericField instances. If the goal is to have good "defaults" based on type then why not just have a constant per type that people can refer to explicitly? if they don't know what number to pick ... as well as a true "default" if they pick nothing. int DEFAULT_STEP = ...; int SUGGESTED_INT_STEP = ...; int SUGGESTED_FLOAT_STEP = ...;
        Hide
        Uwe Schindler added a comment -

        In my opinion, I would like to keep the precisionStep parameter required and give the 4 constants for each data type in NumericUtils.

        On the other hand 4 is a (maybe) good default, so I propose, that all ctors/factories getting a precisionStep default it to 4, if left out. precisionStep is a final variable in NumericTokenStream (and so in NumericField), because it does not make sense to change it. If "field" is final, also precisionStep should be final (one field must always use the same precision step). In principle Mike is right, the type is also fixed after first calling setXxxValue, so I could throw an IAE, if somebody calles a setter for a different datatype after the first one. A IllegalStateEx is thrown, when the value was not initialized and the docinverter tries to use the token stream.

        Here are two ideas to fix this the defaultPrecStep per type:

        1. Special value 0 as default precStep:

        • the no-precStep ctor sets the precStep in NumTokenStream to 0 (invalid value), if one is given it must be >0 and <=valsize
        • when delivering the tokens, NumTokenStream uses the default for this data type if precStep==0 and the given value in all other cases
          In this case the precStep is still final in NumericTokenStream, with 0 means "automatic".

        2. There is one other idea:
        NumericField/-TokenStream could have a required ctor param type that can be NumericField.Type.INT,... In this case the default could be choosen very simple at the beginning. And it also fixes the data type. If somebody calls setDoubleValue but has initialized the TokenStream with NumericField.Type.INT, he will get an UOE.

        The javadocs should always clearly note, that one should check out a good precStep.


        By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting). RangeQueries would still work, but are as slow as conventional ones (current solr trunk contains this hint in its TrieField docs/schema)

        Show
        Uwe Schindler added a comment - In my opinion, I would like to keep the precisionStep parameter required and give the 4 constants for each data type in NumericUtils. On the other hand 4 is a (maybe) good default, so I propose, that all ctors/factories getting a precisionStep default it to 4, if left out. precisionStep is a final variable in NumericTokenStream (and so in NumericField), because it does not make sense to change it. If "field" is final, also precisionStep should be final (one field must always use the same precision step). In principle Mike is right, the type is also fixed after first calling setXxxValue, so I could throw an IAE, if somebody calles a setter for a different datatype after the first one. A IllegalStateEx is thrown, when the value was not initialized and the docinverter tries to use the token stream. Here are two ideas to fix this the defaultPrecStep per type: 1. Special value 0 as default precStep: the no-precStep ctor sets the precStep in NumTokenStream to 0 (invalid value), if one is given it must be >0 and <=valsize when delivering the tokens, NumTokenStream uses the default for this data type if precStep==0 and the given value in all other cases In this case the precStep is still final in NumericTokenStream, with 0 means "automatic". 2. There is one other idea: NumericField/-TokenStream could have a required ctor param type that can be NumericField.Type.INT,... In this case the default could be choosen very simple at the beginning. And it also fixes the data type. If somebody calls setDoubleValue but has initialized the TokenStream with NumericField.Type.INT, he will get an UOE. The javadocs should always clearly note, that one should check out a good precStep. — By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting). RangeQueries would still work, but are as slow as conventional ones (current solr trunk contains this hint in its TrieField docs/schema)
        Hide
        Earwin Burrfoot added a comment -

        Having half of your methods constantly fail with an exception depending on constructor parameter. That just screams - "Split me into two classes!"

        Show
        Earwin Burrfoot added a comment - Having half of your methods constantly fail with an exception depending on constructor parameter. That just screams - "Split me into two classes!"
        Hide
        Uwe Schindler added a comment -

        Four classes! And with LUCENE-1710 there will be six! Not a good idea. 6 classes for NumericTokenStream, 6 for NumericField and maybe 6 for NumRangeQuery/Filter. brrrrr

        Show
        Uwe Schindler added a comment - Four classes! And with LUCENE-1710 there will be six! Not a good idea. 6 classes for NumericTokenStream, 6 for NumericField and maybe 6 for NumRangeQuery/Filter. brrrrr
        Hide
        Uwe Schindler added a comment -

        ...and unmaintainable. I merged the two classes from contrib because of this. It was just duplicate code with some small variances. Always a problem for copy/paste operations.

        Show
        Uwe Schindler added a comment - ...and unmaintainable. I merged the two classes from contrib because of this. It was just duplicate code with some small variances. Always a problem for copy/paste operations.
        Hide
        Michael McCandless added a comment -

        I propose, that all ctors/factories getting a precisionStep default it to 4

        +1, with javadocs encouraging experimentation.

        I think the ideas to conditionalize the default according to type add
        spooky complexity

        By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting).

        Let's be sure to call out this use-case in the javadocs!

        Show
        Michael McCandless added a comment - I propose, that all ctors/factories getting a precisionStep default it to 4 +1, with javadocs encouraging experimentation. I think the ideas to conditionalize the default according to type add spooky complexity By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting). Let's be sure to call out this use-case in the javadocs!
        Hide
        Hoss Man added a comment -

        By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting).

        This screams out for additional (redundant) constants that are self documenting in their names...

        int PRECISION_STEP_DEFAULT = 4; // i think?
        int PRECISION_STEP_SUGGESTED_INT_RANGEANDSORT = ...; // no idea what this should be
        int PRECISION_STEP_SUGGESTED_INT_SORTONLY = 32; // i think?
        int PRECISION_STEP_SUGGESTED_FLOAT_RANGEANDSORT = ...; // no idea what this should be
        int PRECISION_STEP_SUGGESTED_FLOAT_SORTONLY = 32; // i think?
        ...
        
        Show
        Hoss Man added a comment - By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting). This screams out for additional (redundant) constants that are self documenting in their names... int PRECISION_STEP_DEFAULT = 4; // i think? int PRECISION_STEP_SUGGESTED_INT_RANGEANDSORT = ...; // no idea what this should be int PRECISION_STEP_SUGGESTED_INT_SORTONLY = 32; // i think? int PRECISION_STEP_SUGGESTED_FLOAT_RANGEANDSORT = ...; // no idea what this should be int PRECISION_STEP_SUGGESTED_FLOAT_SORTONLY = 32; // i think? ...
        Hide
        Michael McCandless added a comment -

        How about we add PRECISION_STEP_DEFAULT=4, make that the default for all types, and then note in the javadocs the "interesting" values for precision step (ie for sorting only)?

        Show
        Michael McCandless added a comment - How about we add PRECISION_STEP_DEFAULT=4, make that the default for all types, and then note in the javadocs the "interesting" values for precision step (ie for sorting only)?
        Hide
        Michael McCandless added a comment -

        I'm assuming this one is yours Uwe!

        Show
        Michael McCandless added a comment - I'm assuming this one is yours Uwe!
        Hide
        Uwe Schindler added a comment -

        Attached is a patch with the default precisionStep of 4. The javadocs of NumericRangeQuery list all possible and senseful values.

        This patch also contains some cleanup in NumericUtils (rename constants) and a lot of other JavaDocs fixes. It also changes the token types of the TokenStream (no difference between 32/64 bit vals needed) and adds a test for them.

        Show
        Uwe Schindler added a comment - Attached is a patch with the default precisionStep of 4. The javadocs of NumericRangeQuery list all possible and senseful values. This patch also contains some cleanup in NumericUtils (rename constants) and a lot of other JavaDocs fixes. It also changes the token types of the TokenStream (no difference between 32/64 bit vals needed) and adds a test for them.
        Hide
        Michael McCandless added a comment -

        Patch looks good Uwe!

        Show
        Michael McCandless added a comment - Patch looks good Uwe!
        Hide
        Uwe Schindler added a comment -

        Some updates:

        • remove the limitation of the precisionStep to the value size (32 or 64). To index a non-trie-encoded numeric field, you can simply use Integer.MAX_VALUE or any other value>=64 as precStep. The lower limit of precStep is 1.
        • add a test for 64 bit values with recommended precStep=6
        • add simple test for unlimited precStep as above
          I will commit this shortly.
        Show
        Uwe Schindler added a comment - Some updates: remove the limitation of the precisionStep to the value size (32 or 64). To index a non-trie-encoded numeric field, you can simply use Integer.MAX_VALUE or any other value>=64 as precStep. The lower limit of precStep is 1. add a test for 64 bit values with recommended precStep=6 add simple test for unlimited precStep as above I will commit this shortly.
        Hide
        Uwe Schindler added a comment -

        Committed revision: 793823

        Thanks Mike!

        Show
        Uwe Schindler added a comment - Committed revision: 793823 Thanks Mike!

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development