Lucene - Core
  1. Lucene - Core
  2. LUCENE-1701

Add NumericField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: core/index, core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In discussions about LUCENE-1673, Mike & me wanted to add a new NumericField to o.a.l.document specific for easy indexing. An alternative would be to add a NumericUtils.newXxxField() factory, that creates a preconfigured Field instance with norms and tf off, optionally a stored text (LUCENE-1699) and the TokenStream already initialized. On the other hand NumericUtils.newXxxSortField could be moved to NumericSortField.

      I and Yonik tend to use the factory for both, Mike tends to create the new classes.

      Also the parsers for string-formatted numerics are not public in FieldCache. As the new SortField API (LUCENE-1478) makes it possible to support a parser in SortField instantiation, it would be good to have the static parsers in FieldCache public available. SortField would init its member variable to them (instead of NULL), so making code a lot easier (FieldComparator has this ugly null checks when retrieving values from the cache).

      Moving the Trie parsers also as static instances into FieldCache would make the code cleaner and we would be able to hide the "hack" StopFillCacheException by making it private to FieldCache (currently its public because NumericUtils is in o.a.l.util).

      1. LUCENE-1701.patch
        111 kB
        Uwe Schindler
      2. LUCENE-1701.patch
        111 kB
        Uwe Schindler
      3. LUCENE-1701.patch
        106 kB
        Uwe Schindler
      4. LUCENE-1701.patch
        101 kB
        Uwe Schindler
      5. LUCENE-1701.patch
        100 kB
        Uwe Schindler
      6. LUCENE-1701.patch
        99 kB
        Uwe Schindler
      7. LUCENE-1701.patch
        98 kB
        Uwe Schindler
      8. LUCENE-1701-test-tag-special.patch
        2 kB
        Uwe Schindler
      9. NumericField.java
        4 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Earwin Burrfoot added a comment -

          I vote for factories - escaping back-compat woes by exposing minimum interface.

          Show
          Earwin Burrfoot added a comment - I vote for factories - escaping back-compat woes by exposing minimum interface.
          Hide
          Michael McCandless added a comment -

          Uwe can you also open an issue for handling byte/short/Date with
          Numeric*?

          I vote for factories - escaping back-compat woes by exposing minimum interface.

          By this same logic, should we remove NumericRangeFilter/Query and use
          static factories instead?

          We can't let fear of our back-compat policies prevent progress.

          I seem to be the only one [who's speaking up, at least] who feels
          consumability of Lucene's APIs is important...

          Here's my reasoning: numeric fields are common; many apps need them.
          But, it's painful to use them today; it's a trap for users because
          Lucene acts like it can handle them (eg SortField.INT exists) but then
          RangeQuery is buggy unless you encode the numbers (zero pad ints, use
          Solr's or your own NumberUtils for floats/doubles). And once you
          figured out the encoding, you discovered RangeQuery can have horrific
          performance.

          For the longest time Lucene could not provide good ootb handling of
          numerics, but now finally an awesome step forward (thank you Uwe!)
          comes along... and Lucene can provide correct & performant handling of
          numerics.

          Such an important & useful functionality deserves a consumable API.
          It should be obvious to people playing with Lucene how to use numeric
          fields. I should be able to do this:

          Document doc = new Document();
          doc.add(new NumericField("price", 15.50f));
          

          not this:

          Document doc = new Document();
          Field f = new Field("price", new NumericTokenStream(4).setFloatValue(15.50f));
          f.setOmitNorms(true);
          f.setOmitTermFreqAndPositions(true);
          doc.add(field);
          

          nor, this:

          Document doc = new Document();
          doc.add(NumericUtils.createFloatField("price", 15.50f));
          

          When I want to reuse, I should be able to call
          NumericField.setFloatValue(), not ask the TokenStream to set the
          value.

          In fact, as a user of this API, I shouldn't even have to know that a
          powerful TokenStream was created to index my NumericField. I
          shouldn't have to know to set those advanced flags on Field. These
          are implementation details. In fact with time we may make
          improvements to these "implemenation details", so we don't want such
          implementation details out in the user's code.

          NumericUtils should be utility methods used only by the current
          implemention. Ideally it would not even be public, but Java doesn't
          give us the ability to be package private to "org.apache.lucene.*".

          Here's what I propose:

          • Add NumericField and NumericSortField, and
            rename RangeQuery -> TermRangeQuery (TextRangeQuery?).
          • Move the Numeric FieldCache parsers into FieldCache,
            and make them (PLAIN_TEXT_INT_PARSER vs NUMERIC_INT_PARSER) public.
          • I would also really like to have NumericField come back when you
            retrieve the doc; this only requires 1 bit added to the flags
            stored in each doc's entry in the .fdt file.

          Why should we make such an excellent addition to Lucene, only to make
          it hard to use?

          Show
          Michael McCandless added a comment - Uwe can you also open an issue for handling byte/short/Date with Numeric*? I vote for factories - escaping back-compat woes by exposing minimum interface. By this same logic, should we remove NumericRangeFilter/Query and use static factories instead? We can't let fear of our back-compat policies prevent progress. I seem to be the only one [who's speaking up, at least] who feels consumability of Lucene's APIs is important... Here's my reasoning: numeric fields are common; many apps need them. But, it's painful to use them today; it's a trap for users because Lucene acts like it can handle them (eg SortField.INT exists) but then RangeQuery is buggy unless you encode the numbers (zero pad ints, use Solr's or your own NumberUtils for floats/doubles). And once you figured out the encoding, you discovered RangeQuery can have horrific performance. For the longest time Lucene could not provide good ootb handling of numerics, but now finally an awesome step forward (thank you Uwe!) comes along... and Lucene can provide correct & performant handling of numerics. Such an important & useful functionality deserves a consumable API. It should be obvious to people playing with Lucene how to use numeric fields. I should be able to do this: Document doc = new Document(); doc.add( new NumericField( "price" , 15.50f)); not this: Document doc = new Document(); Field f = new Field( "price" , new NumericTokenStream(4).setFloatValue(15.50f)); f.setOmitNorms( true ); f.setOmitTermFreqAndPositions( true ); doc.add(field); nor, this: Document doc = new Document(); doc.add(NumericUtils.createFloatField( "price" , 15.50f)); When I want to reuse, I should be able to call NumericField.setFloatValue() , not ask the TokenStream to set the value. In fact, as a user of this API, I shouldn't even have to know that a powerful TokenStream was created to index my NumericField. I shouldn't have to know to set those advanced flags on Field. These are implementation details. In fact with time we may make improvements to these "implemenation details", so we don't want such implementation details out in the user's code. NumericUtils should be utility methods used only by the current implemention. Ideally it would not even be public, but Java doesn't give us the ability to be package private to "org.apache.lucene.*". Here's what I propose: Add NumericField and NumericSortField, and rename RangeQuery -> TermRangeQuery (TextRangeQuery?). Move the Numeric FieldCache parsers into FieldCache, and make them (PLAIN_TEXT_INT_PARSER vs NUMERIC_INT_PARSER) public. I would also really like to have NumericField come back when you retrieve the doc; this only requires 1 bit added to the flags stored in each doc's entry in the .fdt file. Why should we make such an excellent addition to Lucene, only to make it hard to use?
          Hide
          Uwe Schindler added a comment - - edited

          But the same problem like with NumericTokenStream affects also NumericField, because of type safety it will only work with a setXxxValue (if not factory), e.g.

          doc.add(new NumericField("price", precisionStep).setFloatValue(15.50f));
          

          This code is not shorter than:

          doc.add(NumericUtils.newFloatField("price", precisionStep, 15.50f));
          

          Additionally with LUCENE-1699, we could also add Field.Store.XXX to the factory/ctor.

          OK, the factory solution has the problem, that you cannot reuse the field for effectiveness, so this is an argument for the extra class, that has setXxXValue().

          For SortField: The factory code inside NumericUtils is only one Line, you only create a conventional SortField with a specific parser. If we do not want to have the factory in NumericUtils, I could also add an additional ctor option to the normal sortfield (which is still there: it takes the parser, LUCENE-1478). When all parsers are central in the FieldCache, one can create a SortField with one line of code (the current factory demonstrates this).

          Show
          Uwe Schindler added a comment - - edited But the same problem like with NumericTokenStream affects also NumericField, because of type safety it will only work with a setXxxValue (if not factory), e.g. doc.add( new NumericField( "price" , precisionStep).setFloatValue(15.50f)); This code is not shorter than: doc.add(NumericUtils.newFloatField( "price" , precisionStep, 15.50f)); Additionally with LUCENE-1699 , we could also add Field.Store.XXX to the factory/ctor. OK, the factory solution has the problem, that you cannot reuse the field for effectiveness, so this is an argument for the extra class, that has setXxXValue(). For SortField: The factory code inside NumericUtils is only one Line, you only create a conventional SortField with a specific parser. If we do not want to have the factory in NumericUtils, I could also add an additional ctor option to the normal sortfield (which is still there: it takes the parser, LUCENE-1478 ). When all parsers are central in the FieldCache, one can create a SortField with one line of code (the current factory demonstrates this).
          Hide
          Earwin Burrfoot added a comment - - edited

          Mike, I very much agree with everything you said, except "factory is less consumable than constructor" and "add stuff to index to handle NumericField".

          Out of your three examples the second one is bad, no questions. But first and last are absolutely equal in terms of consumability.
          Static factories are cool (they allow to switch implementations and instantiation logic without changing API) and are as easy to use (probably even easier with generics in Java5) as constructors.

          If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields.
          Tying index format to some particular implementation of numerics is bad design. Why on earth can't my own split-field (vs single-field as in current Lucene) trie-encoded number enjoy the same benefits as NumericField from Lucene core?

          By this same logic, should we remove NumericRangeFilter/Query and use static factories instead?

          I do use factory methods for all my queries and filters, and it makes me feel warm and fuzzy! Under the hood some of them consult FieldInfo to instantiate custom-tailored query variants, so I just use range(CREATION_TIME, from, to) and don't think if this field is trie-encoded or raw.

          "Simple things should be simple", okay. Complex things should be simple too, argh!

          Show
          Earwin Burrfoot added a comment - - edited Mike, I very much agree with everything you said, except "factory is less consumable than constructor" and "add stuff to index to handle NumericField". Out of your three examples the second one is bad, no questions. But first and last are absolutely equal in terms of consumability. Static factories are cool (they allow to switch implementations and instantiation logic without changing API) and are as easy to use (probably even easier with generics in Java5) as constructors. If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields. Tying index format to some particular implementation of numerics is bad design. Why on earth can't my own split-field (vs single-field as in current Lucene) trie-encoded number enjoy the same benefits as NumericField from Lucene core? By this same logic, should we remove NumericRangeFilter/Query and use static factories instead? I do use factory methods for all my queries and filters, and it makes me feel warm and fuzzy! Under the hood some of them consult FieldInfo to instantiate custom-tailored query variants, so I just use range(CREATION_TIME, from, to) and don't think if this field is trie-encoded or raw. "Simple things should be simple", okay. Complex things should be simple too, argh!
          Hide
          Uwe Schindler added a comment -

          Here is a first draft of NumericField with the same handling as NumericTokenStream. It is for indexing only, on retrieving stored fields, one would get the numeric field value as a string (according to Number.toString()). Because of this, this class returns in stringValue() the string representation of the numeric value and with tokenStreamValue the NumericTokenStream is returned.

          For SortField I still have the very strong opinion, that here a extra class is not needed. A factory is enough (and even too much, supplying the Parser to SortField would be enough).

          I will later post a patch with this file and the moved/made public Parsers.

          Show
          Uwe Schindler added a comment - Here is a first draft of NumericField with the same handling as NumericTokenStream. It is for indexing only, on retrieving stored fields, one would get the numeric field value as a string (according to Number.toString()). Because of this, this class returns in stringValue() the string representation of the numeric value and with tokenStreamValue the NumericTokenStream is returned. For SortField I still have the very strong opinion, that here a extra class is not needed. A factory is enough (and even too much, supplying the Parser to SortField would be enough). I will later post a patch with this file and the moved/made public Parsers.
          Hide
          Yonik Seeley added a comment -

          Having the trie parsers public is good (or public factory method(s) to get the right parser given a set of trie params), but shouldn't they stay with the trie classes? Or am I misunderstanding where you are proposing to move the parsers?

          Show
          Yonik Seeley added a comment - Having the trie parsers public is good (or public factory method(s) to get the right parser given a set of trie params), but shouldn't they stay with the trie classes? Or am I misunderstanding where you are proposing to move the parsers?
          Hide
          Uwe Schindler added a comment -

          Yonik, I will explain my intention:
          The real numeric parsing is always done in NumericUtils, as for conventional FieldCache the real parsing is done in Integer.parseInt() and so on.
          Specific to the FieldCache is the Parser interface. This parser interface is a java interface specific to FieldCache. FieldCache currently has (private) static parser instances for Number.toString()-type fields. These parser are simple 5-liners and singletons. For trie fields, this is the same, the static parser instances (also singletons) should be moved also to FieldCache, after that both as public constants (like Mike said: PLAIN_TEXT_INT_PARSER and NUMERIC_INT_PARSER).
          Currently for Trie fields there is a ugly hack in FieldCache, that stops parsing, when a term with lower precision is reached (as trie terms are ordered with highest precision first, the cache for a field is filled, when the first lower-precision term comes). Because of this, the trie parsers throw a unchecked StopFillCacheException to stop the iteration of TermEnum/TermDocs in the Uninverter. This is just a hack and because of package differences this FieldCache-internal exception is made public (see Javadocs). When moving the parser interfaces to FieldCace, this Exception can be hidden again and made private to the FieldCache implementation (until we have the better univerters some time in future, see LUCENE-831). NumericUtils then will only just export a method to get the shift value out of a encoded string (which I forgot to add in LUCENE-1673 when removing ShiftAttribute).
          Trie field parsing does not depend on Trie-specific flags, precisionStep is not needed, so the parsers are real singletons.

          SortFields can then simply created in the Following way: new SortField(field, FieldCache.PLAIN_TEXT_INT_PARSER) for a conventional int field (like with SortField.INT) or new SortField(field, FieldCache.NUMERIC_INT_PARSER). giving a NULL parser still does the same as before, it uses FieldCache.PLAIN_TEXT_INT_PARSER implicitely.

          Show
          Uwe Schindler added a comment - Yonik, I will explain my intention: The real numeric parsing is always done in NumericUtils, as for conventional FieldCache the real parsing is done in Integer.parseInt() and so on. Specific to the FieldCache is the Parser interface. This parser interface is a java interface specific to FieldCache. FieldCache currently has (private) static parser instances for Number.toString()-type fields. These parser are simple 5-liners and singletons. For trie fields, this is the same, the static parser instances (also singletons) should be moved also to FieldCache, after that both as public constants (like Mike said: PLAIN_TEXT_INT_PARSER and NUMERIC_INT_PARSER). Currently for Trie fields there is a ugly hack in FieldCache, that stops parsing, when a term with lower precision is reached (as trie terms are ordered with highest precision first, the cache for a field is filled, when the first lower-precision term comes). Because of this, the trie parsers throw a unchecked StopFillCacheException to stop the iteration of TermEnum/TermDocs in the Uninverter. This is just a hack and because of package differences this FieldCache-internal exception is made public (see Javadocs). When moving the parser interfaces to FieldCace, this Exception can be hidden again and made private to the FieldCache implementation (until we have the better univerters some time in future, see LUCENE-831 ). NumericUtils then will only just export a method to get the shift value out of a encoded string (which I forgot to add in LUCENE-1673 when removing ShiftAttribute). Trie field parsing does not depend on Trie-specific flags, precisionStep is not needed, so the parsers are real singletons. SortFields can then simply created in the Following way: new SortField(field, FieldCache.PLAIN_TEXT_INT_PARSER) for a conventional int field (like with SortField.INT) or new SortField(field, FieldCache.NUMERIC_INT_PARSER). giving a NULL parser still does the same as before, it uses FieldCache.PLAIN_TEXT_INT_PARSER implicitely.
          Hide
          Yonik Seeley added a comment -

          Regardless of the fact that plain_int parser is on FieldCache, it still doesn't seem like we should add parsers to FieldCache for every field type. It's also the case that a single static parser won't be able to handle all the cases... consider future functionality of using positions or payloads to fill out the full value. A factory allows you to return the correct implementation given the parameters (number of bits to store as position, etc).

          Show
          Yonik Seeley added a comment - Regardless of the fact that plain_int parser is on FieldCache, it still doesn't seem like we should add parsers to FieldCache for every field type. It's also the case that a single static parser won't be able to handle all the cases... consider future functionality of using positions or payloads to fill out the full value. A factory allows you to return the correct implementation given the parameters (number of bits to store as position, etc).
          Hide
          Uwe Schindler added a comment -

          When this comes (payloads, CSF,...) we will have the new LUCENE-831 field cache, where we will have a ValueSource-like thing. Until then, a static parser is enough, and the static parser is still in NumericUtils! I want to move it because of this hack with the unchecked exception. When the new FieldCache is alive this is all nonsense.

          Show
          Uwe Schindler added a comment - When this comes (payloads, CSF,...) we will have the new LUCENE-831 field cache, where we will have a ValueSource-like thing. Until then, a static parser is enough, and the static parser is still in NumericUtils! I want to move it because of this hack with the unchecked exception. When the new FieldCache is alive this is all nonsense.
          Hide
          Yonik Seeley added a comment -

          The exception certainly is a hack - but any new field cache API should be powerful enough to handle trie through the normal APIs that anyone else would have to go through when implementing their own field type. If Trie needs to be part of the new field cache implementation, that would be a big red flag.

          Show
          Yonik Seeley added a comment - The exception certainly is a hack - but any new field cache API should be powerful enough to handle trie through the normal APIs that anyone else would have to go through when implementing their own field type. If Trie needs to be part of the new field cache implementation, that would be a big red flag.
          Hide
          Uwe Schindler added a comment -

          ut any new field cache API should be powerful enough to handle trie through the normal APIs that anyone else would have to go through when implementing their own field type

          This is true, the LUCENE-831 patch contains a TrieValueSource for that (but it does not apply anymore, as contrib/search/trie is no longer available). Because of this, it can be implemented very cleanly.

          For easy usage, I would simply suggest to have the parsers for the current plain text and trie field cache implementation public available as singletons, very simple and hurts nobody. I will post a patch for the whole case soon. NumericField will be tested in the NumericRangeQuery index creation (which is currently very ugly, like meikes comments about the code and reusing Fields/TokenStreams, with NumericField it looks like any other index code).

          For Solr (SOLR-940), there is not need to use NumericField (which is just a helper), the code of TrieField can stay as it is, only some renamings and so on. It just presents a TokenStream to the underlying Solr indexer, as before.

          Show
          Uwe Schindler added a comment - ut any new field cache API should be powerful enough to handle trie through the normal APIs that anyone else would have to go through when implementing their own field type This is true, the LUCENE-831 patch contains a TrieValueSource for that (but it does not apply anymore, as contrib/search/trie is no longer available). Because of this, it can be implemented very cleanly. For easy usage, I would simply suggest to have the parsers for the current plain text and trie field cache implementation public available as singletons, very simple and hurts nobody. I will post a patch for the whole case soon. NumericField will be tested in the NumericRangeQuery index creation (which is currently very ugly, like meikes comments about the code and reusing Fields/TokenStreams, with NumericField it looks like any other index code). For Solr ( SOLR-940 ), there is not need to use NumericField (which is just a helper), the code of TrieField can stay as it is, only some renamings and so on. It just presents a TokenStream to the underlying Solr indexer, as before.
          Hide
          Michael McCandless added a comment -

          Here is a first draft of NumericField with the same handling as NumericTokenStream.

          Looks good Uwe! Is there an OK default for precisionStep so we don't have
          to make that a required arg? 4?

          Show
          Michael McCandless added a comment - Here is a first draft of NumericField with the same handling as NumericTokenStream. Looks good Uwe! Is there an OK default for precisionStep so we don't have to make that a required arg? 4?
          Hide
          Michael McCandless added a comment -

          Static factories are cool (they allow to switch implementations and instantiation logic without changing API) and are as easy to use (probably even easier with generics in Java5) as constructors.

          Coolness is in the eye of the beholder?

          Yes, they are cool in that they give the developer (us) future
          freedom (to change the actual class returned, re-use instances, use
          singletons, etc.), but not cool (in my eyes) for consumability.

          Static factory classes are a good fit when the impls really should
          remain anonymous because there are trivial differences. EG the 12
          different impls that can be returned by TopFieldCollector.create are a
          good example.

          But NumericField vs Field, and SortField vs NumericSortField, are
          different and should be seen as different to consumer of Lucene's API.

          If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields.

          +1 Wanna make a patch?

          Then NumericField would just tap in to this extensibility... and,
          somehow, in our future improved search time document() API, have the
          ability to make a NumericField.

          Why on earth can't my own split-field (vs single-field as in current Lucene) trie-encoded number enjoy the same benefits as NumericField from Lucene core?

          Because.... we've decided that this is our core approach to numerics?

          Seriously, I don't see that as unfair. Trie works well. We have
          chosen it as our way (for now, until something better comes along) of
          handling numerics. Just like we've picked a certain format for the
          terms dict and prx file.

          Sure, we should make it easy (add extensibility) so external fields
          could store stuff in the index, but that doesn't mean we should hold
          back on Numeric* consumability until we get that extensibility.

          I do use factory methods for all my queries and filters, and it makes me feel warm and fuzzy! Under the hood some of them consult FieldInfo to instantiate custom-tailored query variants, so I just use range(CREATION_TIME, from, to) and don't think if this field is trie-encoded or raw.

          Someday maybe I'll convince you to donate this "schema" layer on top
          of Lucene But I hope there are SOME named classes in there and not
          all static factory methods returning anonymous untyped impls.

          "Simple things should be simple", okay. Complex things should be simple too, argh!

          Whoa, this is all simple stuff? What should be complex about using
          numeric fields in Lucene? This whole issue is "simple things should
          be simple".

          Show
          Michael McCandless added a comment - Static factories are cool (they allow to switch implementations and instantiation logic without changing API) and are as easy to use (probably even easier with generics in Java5) as constructors. Coolness is in the eye of the beholder? Yes, they are cool in that they give the developer (us) future freedom (to change the actual class returned, re-use instances, use singletons, etc.), but not cool (in my eyes) for consumability. Static factory classes are a good fit when the impls really should remain anonymous because there are trivial differences. EG the 12 different impls that can be returned by TopFieldCollector.create are a good example. But NumericField vs Field, and SortField vs NumericSortField, are different and should be seen as different to consumer of Lucene's API. If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields. +1 Wanna make a patch? Then NumericField would just tap in to this extensibility... and, somehow, in our future improved search time document() API, have the ability to make a NumericField. Why on earth can't my own split-field (vs single-field as in current Lucene) trie-encoded number enjoy the same benefits as NumericField from Lucene core? Because.... we've decided that this is our core approach to numerics? Seriously, I don't see that as unfair. Trie works well. We have chosen it as our way (for now, until something better comes along) of handling numerics. Just like we've picked a certain format for the terms dict and prx file. Sure, we should make it easy (add extensibility) so external fields could store stuff in the index, but that doesn't mean we should hold back on Numeric* consumability until we get that extensibility. I do use factory methods for all my queries and filters, and it makes me feel warm and fuzzy! Under the hood some of them consult FieldInfo to instantiate custom-tailored query variants, so I just use range(CREATION_TIME, from, to) and don't think if this field is trie-encoded or raw. Someday maybe I'll convince you to donate this "schema" layer on top of Lucene But I hope there are SOME named classes in there and not all static factory methods returning anonymous untyped impls. "Simple things should be simple", okay. Complex things should be simple too, argh! Whoa, this is all simple stuff? What should be complex about using numeric fields in Lucene? This whole issue is "simple things should be simple".
          Hide
          Yonik Seeley added a comment -

          Because.... we've decided that this is our core approach to numerics?

          We decided to move trie from contrib to core because it was the most stable and usable numeric implementation. We did not decide to rewrite it, or make it "special". It's not sufficient for everyone, there will be (many) enhancements to trie, and there will be other numeric field types. Trie is not, and should not be the only numeric field, and should not be baked into the index format.

          The next step after adding NumericField seems to be "it's a bug if getDocument() doesn't return a NumericField, so we must encode it in the index". If that's the case, I'm -1 on adding NumericField in the first place.

          Show
          Yonik Seeley added a comment - Because.... we've decided that this is our core approach to numerics? We decided to move trie from contrib to core because it was the most stable and usable numeric implementation. We did not decide to rewrite it, or make it "special". It's not sufficient for everyone, there will be (many) enhancements to trie, and there will be other numeric field types. Trie is not, and should not be the only numeric field, and should not be baked into the index format. The next step after adding NumericField seems to be "it's a bug if getDocument() doesn't return a NumericField, so we must encode it in the index". If that's the case, I'm -1 on adding NumericField in the first place.
          Hide
          Michael McCandless added a comment -

          I still think we should make NumericSortField strongly typed (not a
          factory method that returns a SortField w/ the right parser).

          I think it's far more consumable, from the user's standpoint. I made
          a NumericField when indexing so making a NumericSortField to sort
          makes it "obvious".

          Ie this:

          new NumericSortField("price");
          

          is better than this:

          new SortField("price", FieldCache.NUMERIC_FLOAT_PARSER);
          

          or this:

          SortField.getNumericSortField("price", SortField.FLOAT);
          

          Uwe, I agree that if we take the developer's (us) standpoint, the
          implementation of NumericSortField is so trivial (just pick the right
          parser) that it's tempting to not name the class. But from the user's
          standpoint it's less consumable.

          Regardless of the fact that plain_int parser is on FieldCache, it still doesn't seem like we should add parsers to FieldCache for every field type.

          {Extended,}

          FieldCache already is the central place that holds parsers
          for all of Lucene's core types; why change that? Leaving Numeric* out
          is dangerous because then people will naturally assume getFloats() is
          the method to call.

          Show
          Michael McCandless added a comment - I still think we should make NumericSortField strongly typed (not a factory method that returns a SortField w/ the right parser). I think it's far more consumable, from the user's standpoint. I made a NumericField when indexing so making a NumericSortField to sort makes it "obvious". Ie this: new NumericSortField( "price" ); is better than this: new SortField( "price" , FieldCache.NUMERIC_FLOAT_PARSER); or this: SortField.getNumericSortField( "price" , SortField.FLOAT); Uwe, I agree that if we take the developer's (us) standpoint, the implementation of NumericSortField is so trivial (just pick the right parser) that it's tempting to not name the class. But from the user's standpoint it's less consumable. Regardless of the fact that plain_int parser is on FieldCache, it still doesn't seem like we should add parsers to FieldCache for every field type. {Extended,} FieldCache already is the central place that holds parsers for all of Lucene's core types; why change that? Leaving Numeric* out is dangerous because then people will naturally assume getFloats() is the method to call.
          Hide
          Yonik Seeley added a comment -

          new NumericSortField("price");

          Magic.
          How is this supposed to work?
          How will the sort field know the exact encoding? How many bits are stored in a payload or in the position?
          Could I make my own field type that had the same privileges?

          Show
          Yonik Seeley added a comment - new NumericSortField("price"); Magic. How is this supposed to work? How will the sort field know the exact encoding? How many bits are stored in a payload or in the position? Could I make my own field type that had the same privileges?
          Hide
          Uwe Schindler added a comment -

          I aggree with Yonik, this is too much magic and would not work. There must be at least the type, which would be SortField.FLOAT or something like that). If you keep that in mind, there is really no difference between:

          new SortField("price", FieldCache.NUMERIC_FLOAT_PARSER)
          

          and

          new SortField("price", FieldCache.PLAIN_TEXT_FLOAT_PARSER)
          

          equivalent to:

          new SortField("price", SortField.FLOAT)
          
          Show
          Uwe Schindler added a comment - I aggree with Yonik, this is too much magic and would not work. There must be at least the type, which would be SortField.FLOAT or something like that). If you keep that in mind, there is really no difference between: new SortField( "price" , FieldCache.NUMERIC_FLOAT_PARSER) and new SortField( "price" , FieldCache.PLAIN_TEXT_FLOAT_PARSER) equivalent to: new SortField( "price" , SortField.FLOAT)
          Hide
          Michael McCandless added a comment -

          It's not sufficient for everyone, there will be (many) enhancements to trie, and there will be other numeric field types. Trie is not, and should not be the only numeric field, and should not be baked into the index format.

          But trie is the best we have today?

          And it's sooooo much better than we had before? Prior to trie, with
          Lucene directly if you did a RangeQuery on a float/double field, it
          was a nightmare. You had to get Solr's NumberUtils (or, roll your
          own) to get something that even returned the correct results. Then,
          you'll discover that performance was basically unusable.

          The addition of numeric indexing to Lucene is a major step forward.
          Why not make it work well with Lucene, today, and as these future
          improvements arrive, we take them as they come? Design for today.

          Anyway, if push comes to shove (which it seems to be doing!), I can
          accept just returning a Field (not NumericField) from the document at
          search time. I think it's a worse user experience, and will be seen
          as buggy, but it would in fact mean zero change to the index format.

          Show
          Michael McCandless added a comment - It's not sufficient for everyone, there will be (many) enhancements to trie, and there will be other numeric field types. Trie is not, and should not be the only numeric field, and should not be baked into the index format. But trie is the best we have today? And it's sooooo much better than we had before? Prior to trie, with Lucene directly if you did a RangeQuery on a float/double field, it was a nightmare. You had to get Solr's NumberUtils (or, roll your own) to get something that even returned the correct results. Then, you'll discover that performance was basically unusable. The addition of numeric indexing to Lucene is a major step forward. Why not make it work well with Lucene, today, and as these future improvements arrive, we take them as they come? Design for today. Anyway, if push comes to shove (which it seems to be doing!), I can accept just returning a Field (not NumericField) from the document at search time. I think it's a worse user experience, and will be seen as buggy, but it would in fact mean zero change to the index format.
          Hide
          Michael McCandless added a comment -

          new NumericSortField("price");

          Magic.
          How is this supposed to work?
          How will the sort field know the exact encoding? How many bits are stored in a payload or in the position?
          Could I make my own field type that had the same privileges?

          Woops, sorry, you're right: you'd need to specify the type, at least.

          So.... how about for SortField we make the parser a required arg (as
          Uwe suggested) for numeric types? Allow null to mean "give me the
          default parser" for all other types. And we consolidate all parsers
          in FieldCache.

          Show
          Michael McCandless added a comment - new NumericSortField("price"); Magic. How is this supposed to work? How will the sort field know the exact encoding? How many bits are stored in a payload or in the position? Could I make my own field type that had the same privileges? Woops, sorry, you're right: you'd need to specify the type, at least. So.... how about for SortField we make the parser a required arg (as Uwe suggested) for numeric types? Allow null to mean "give me the default parser" for all other types. And we consolidate all parsers in FieldCache.
          Hide
          Uwe Schindler added a comment -

          That is what I was talking about all the time!

          But this is all not really the best solution. It is too bad, that LUCENE-831 (not in current form, it is just not discussed to the end) is not to be included into 2.9. I know we will have to stick with parsers until end of days, because the new ValueSource staff and Uninverters was not introduces early enough to be removed with 3.0. And Trie fields with positions/payloads would never work with current FieldCache, so I need no factories.

          So this would be postponed until this is done. Then we could have a ValueSource for trie fields, where you could add these magic stuff like payloads and so on. But until this comes to reality (including CSF), the static parsers is all we have until now and is best placed in FieldCache (because of the strong linkage with this ugly exception to be hidden to the outside).

          Earwin, Yonik: I know TrieRange is only one implementation of the whole numeric problem, but none of you ever presented your implementation to the public. This is the best we have now, its included into core and everybody is happy. If you have a better private implementation, you can still use it!

          Show
          Uwe Schindler added a comment - That is what I was talking about all the time! But this is all not really the best solution. It is too bad, that LUCENE-831 (not in current form, it is just not discussed to the end) is not to be included into 2.9. I know we will have to stick with parsers until end of days, because the new ValueSource staff and Uninverters was not introduces early enough to be removed with 3.0. And Trie fields with positions/payloads would never work with current FieldCache, so I need no factories. So this would be postponed until this is done. Then we could have a ValueSource for trie fields, where you could add these magic stuff like payloads and so on. But until this comes to reality (including CSF), the static parsers is all we have until now and is best placed in FieldCache (because of the strong linkage with this ugly exception to be hidden to the outside). Earwin, Yonik: I know TrieRange is only one implementation of the whole numeric problem, but none of you ever presented your implementation to the public. This is the best we have now, its included into core and everybody is happy. If you have a better private implementation, you can still use it!
          Hide
          Michael McCandless added a comment -

          That is what I was talking about all the time!

          Eh, yeah... somethings things just need a good hashing out!

          2.9 is really shaping up to be an awesome release...

          But this is all not really the best solution. It is too bad, that LUCENE-831 (not in current form, it is just not discussed to the end) is not to be included into 2.9.

          Progress not perfection!

          I know we will have to stick with parsers until end of days, because the new ValueSource staff and Uninverters was not introduces early enough to be removed with 3.0.

          Well we are discussing relaxing the back-compat policy... so maybe
          earlier than "end of days".

          And Trie fields with positions/payloads would never work with current FieldCache, so I need no factories.

          We'll cross that bridge when we get there...

          So this would be postponed until this is done. Then we could have a ValueSource for trie fields, where you could add these magic stuff like payloads and so on. But until this comes to reality (including CSF), the static parsers is all we have until now and is best placed in FieldCache (because of the strong linkage with this ugly exception to be hidden to the outside).

          I agree.

          Earwin, Yonik: I know TrieRange is only one implementation of the whole numeric problem, but none of you ever presented your implementation to the public. This is the best we have now, its included into core and everybody is happy. If you have a better private implementation, you can still use it!

          We shouldn't weaken trie's integration to core just because others
          have private implementations. What's important is that we don't
          weaken those private implementations with trie's addition, and I don't
          think our approach here has done that.

          Show
          Michael McCandless added a comment - That is what I was talking about all the time! Eh, yeah... somethings things just need a good hashing out! 2.9 is really shaping up to be an awesome release... But this is all not really the best solution. It is too bad, that LUCENE-831 (not in current form, it is just not discussed to the end) is not to be included into 2.9. Progress not perfection! I know we will have to stick with parsers until end of days, because the new ValueSource staff and Uninverters was not introduces early enough to be removed with 3.0. Well we are discussing relaxing the back-compat policy... so maybe earlier than "end of days". And Trie fields with positions/payloads would never work with current FieldCache, so I need no factories. We'll cross that bridge when we get there... So this would be postponed until this is done. Then we could have a ValueSource for trie fields, where you could add these magic stuff like payloads and so on. But until this comes to reality (including CSF), the static parsers is all we have until now and is best placed in FieldCache (because of the strong linkage with this ugly exception to be hidden to the outside). I agree. Earwin, Yonik: I know TrieRange is only one implementation of the whole numeric problem, but none of you ever presented your implementation to the public. This is the best we have now, its included into core and everybody is happy. If you have a better private implementation, you can still use it! We shouldn't weaken trie's integration to core just because others have private implementations. What's important is that we don't weaken those private implementations with trie's addition, and I don't think our approach here has done that.
          Hide
          Earwin Burrfoot added a comment -

          Someday maybe I'll convince you to donate this "schema" layer on top of Lucene

          It's not generic enough to be of use for every user of Lucene, and it doesn't aim to be such. It also evolves, and donating something to Lucene means casting it in concrete.
          So that's not me being greedy or lazy (okay, maybe a little bit of the latter), it's simply not public-quality (as I understand it) code.
          I can share the design if anybody's interested, but everyone's coping with it themselves it seems.

          Solr has its own schema approach, and it has its merits and downfalls compared to mine. That's what is nice, we're able to use the same library in differing ways, and it doesn't force its sense of 'best practices' on us.

          But I hope there are SOME named classes in there and not all static factory methods returning anonymous untyped impls.

          SOME of them aren't static :-D

          We shouldn't weaken trie's integration to core just because others have private implementations.

          You shouldn't integrate into core something that is not core functionality. Think microkernels.
          It's strange seeing you drive CSFs, custom indexing chains, pluggability everywhere on one side, and trying to add some weird custom properties into index that are tightly interwoven with only one of possible numeric implementations on the other side.

          Design for today.

          And spend two years deprecating and supporting today's designs after you get a better thing tomorrow. Back-compat Lucene-style and agile design aren't something that marries well.

          What's important is that we don't weaken those private implementations with trie's addition, and I don't think our approach here has done that.

          You're weakening Lucene itself by introducing too much coupling between its components.

          IndexReader/Writer pair is a good example of what I'm arguing against. A dusty closet of microfeatures that are tightly interwoven into a complex hard-to-maintain mess with zillions of (possibly broken) control paths - remember mutable deletes/norms+clone/reopen permutations? It could be avoided if IR/W were kept to the bare minimum (which most people are going to use), and more advanced features were built on top of it, not in the same place.

          NRT seems to tread the same path, and I'm not sure it's going to win that much turnaround time after newly-introduced per-segment collection. Some time ago I finished a first version of IR plugins, and enjoy pretty low reopen times (field/facet/filter cache warmups included). (Yes, I'm going to open an issue for plugins once they stabilize enough)

          > If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields.
          +1 Wanna make a patch?

          No, I'd like to continue IR cleanup and play with positionIncrement companion value that could enable true multiword synonyms.
          I know, I know, it's do-a-cracy. But it's not an excuse for hacks.

          Show
          Earwin Burrfoot added a comment - Someday maybe I'll convince you to donate this "schema" layer on top of Lucene It's not generic enough to be of use for every user of Lucene, and it doesn't aim to be such. It also evolves, and donating something to Lucene means casting it in concrete. So that's not me being greedy or lazy (okay, maybe a little bit of the latter), it's simply not public-quality (as I understand it) code. I can share the design if anybody's interested, but everyone's coping with it themselves it seems. Solr has its own schema approach, and it has its merits and downfalls compared to mine. That's what is nice, we're able to use the same library in differing ways, and it doesn't force its sense of 'best practices' on us. But I hope there are SOME named classes in there and not all static factory methods returning anonymous untyped impls. SOME of them aren't static :-D We shouldn't weaken trie's integration to core just because others have private implementations. You shouldn't integrate into core something that is not core functionality. Think microkernels. It's strange seeing you drive CSFs, custom indexing chains, pluggability everywhere on one side, and trying to add some weird custom properties into index that are tightly interwoven with only one of possible numeric implementations on the other side. Design for today. And spend two years deprecating and supporting today's designs after you get a better thing tomorrow. Back-compat Lucene-style and agile design aren't something that marries well. What's important is that we don't weaken those private implementations with trie's addition, and I don't think our approach here has done that. You're weakening Lucene itself by introducing too much coupling between its components. IndexReader/Writer pair is a good example of what I'm arguing against. A dusty closet of microfeatures that are tightly interwoven into a complex hard-to-maintain mess with zillions of (possibly broken) control paths - remember mutable deletes/norms+clone/reopen permutations? It could be avoided if IR/W were kept to the bare minimum (which most people are going to use), and more advanced features were built on top of it, not in the same place. NRT seems to tread the same path, and I'm not sure it's going to win that much turnaround time after newly-introduced per-segment collection. Some time ago I finished a first version of IR plugins, and enjoy pretty low reopen times (field/facet/filter cache warmups included). (Yes, I'm going to open an issue for plugins once they stabilize enough) > If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields. +1 Wanna make a patch? No, I'd like to continue IR cleanup and play with positionIncrement companion value that could enable true multiword synonyms. I know, I know, it's do-a-cracy. But it's not an excuse for hacks.
          Hide
          Michael McCandless added a comment -

          Someday maybe I'll convince you to donate this "schema" layer on top of Lucene

          It's not generic enough to be of use for every user of Lucene, and it doesn't aim to be such.

          Oh, OK.

          Solr has its own schema approach, and it has its merits and downfalls compared to mine. That's what is nice, we're able to use the same library in differing ways, and it doesn't force its sense of 'best practices' on us.

          There's no forcing going on, here. Even had we added the bit into the
          index, there's still no "forcing". We're not preventing advanced uses
          of Lucene by providing strong Numeric* support in Lucene. Simple
          things should be simple; complex things should be possible...

          But I hope there are SOME named classes in there and not all static factory methods returning anonymous untyped impls.

          SOME of them aren't static :-D

          Heh.

          We shouldn't weaken trie's integration to core just because others have private implementations.

          You shouldn't integrate into core something that is not core functionality. Think microkernels.
          It's strange seeing you drive CSFs, custom indexing chains, pluggability everywhere on one side, and trying to add some weird custom properties into index that are tightly interwoven with only one of possible numeric implementations on the other side.

          I agree: if Lucene had all extension points that'd make it possible
          for a good integration of Numeric* without being in "core", we should
          use that. But we're just not there yet. We want to get there, and we
          will, but we can't hold up progress just because we think someday
          we'll get there. That's like saying we can't improve the terms dict
          format because it's not pluggable yet.

          Design for today.

          And spend two years deprecating and supporting today's designs after you get a better thing tomorrow. Back-compat Lucene-style and agile design aren't something that marries well.

          donating something to Lucene means casting it in concrete.

          We can't let fear of back-compat prevent us from making progress.

          IndexReader/Writer pair is a good example of what I'm arguing against. A dusty closet of microfeatures that are tightly interwoven into a complex hard-to-maintain mess with zillions of (possibly broken) control paths - remember mutable deletes/norms+clone/reopen permutations? It could be avoided if IR/W were kept to the bare minimum (which most people are going to use), and more advanced features were built on top of it, not in the same place.

          Sure, our approach today isn't perfect ("progress not perfection").
          There are always improvements to be done. If you see concrete steps
          to simplify the current approach without losing functionality, please
          post a patch. I too would love to see such simplifications...

          NRT seems to tread the same path, and I'm not sure it's going to win that much turnaround time after newly-introduced per-segment collection.

          I agree, per-segment collection was the bulk of the gains needed for
          NRT. This was a big change and a huge step forward in simple reopen
          turnaround.

          But, not having to write & read deletes to disk, not commit (fsync)
          from writer in order to see those changes in reader should also give
          us decent gains. fsync is surprisingly and intermittently costly.

          And this integration lets us take it a step further with LUCENE-1313,
          where recently created segments can remain in RAM and be shared with
          the reader.

          If you have good simplifications/improvements on the approach here,
          please post them.

          Some time ago I finished a first version of IR plugins, and enjoy pretty low reopen times (field/facet/filter cache warmups included). (Yes, I'm going to open an issue for plugins once they stabilize enough)

          I'm confused: I thought that effort was to make SegmentReader's
          components fully pluggable? (Not to actually change what components
          SegmentReader is creating). EG does this modularization alter the
          approach to NRT? I thought they were orthogonal.

          If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields.

          +1 Wanna make a patch?

          No, I'd like to continue IR cleanup and play with positionIncrement companion value that could enable true multiword synonyms.

          Well I'm looking forward to seeing your approach on these two!

          Show
          Michael McCandless added a comment - Someday maybe I'll convince you to donate this "schema" layer on top of Lucene It's not generic enough to be of use for every user of Lucene, and it doesn't aim to be such. Oh, OK. Solr has its own schema approach, and it has its merits and downfalls compared to mine. That's what is nice, we're able to use the same library in differing ways, and it doesn't force its sense of 'best practices' on us. There's no forcing going on, here. Even had we added the bit into the index, there's still no "forcing". We're not preventing advanced uses of Lucene by providing strong Numeric* support in Lucene. Simple things should be simple; complex things should be possible... But I hope there are SOME named classes in there and not all static factory methods returning anonymous untyped impls. SOME of them aren't static :-D Heh. We shouldn't weaken trie's integration to core just because others have private implementations. You shouldn't integrate into core something that is not core functionality. Think microkernels. It's strange seeing you drive CSFs, custom indexing chains, pluggability everywhere on one side, and trying to add some weird custom properties into index that are tightly interwoven with only one of possible numeric implementations on the other side. I agree: if Lucene had all extension points that'd make it possible for a good integration of Numeric* without being in "core", we should use that. But we're just not there yet. We want to get there, and we will, but we can't hold up progress just because we think someday we'll get there. That's like saying we can't improve the terms dict format because it's not pluggable yet. Design for today. And spend two years deprecating and supporting today's designs after you get a better thing tomorrow. Back-compat Lucene-style and agile design aren't something that marries well. donating something to Lucene means casting it in concrete. We can't let fear of back-compat prevent us from making progress. IndexReader/Writer pair is a good example of what I'm arguing against. A dusty closet of microfeatures that are tightly interwoven into a complex hard-to-maintain mess with zillions of (possibly broken) control paths - remember mutable deletes/norms+clone/reopen permutations? It could be avoided if IR/W were kept to the bare minimum (which most people are going to use), and more advanced features were built on top of it, not in the same place. Sure, our approach today isn't perfect ("progress not perfection"). There are always improvements to be done. If you see concrete steps to simplify the current approach without losing functionality, please post a patch. I too would love to see such simplifications... NRT seems to tread the same path, and I'm not sure it's going to win that much turnaround time after newly-introduced per-segment collection. I agree, per-segment collection was the bulk of the gains needed for NRT. This was a big change and a huge step forward in simple reopen turnaround. But, not having to write & read deletes to disk, not commit (fsync) from writer in order to see those changes in reader should also give us decent gains. fsync is surprisingly and intermittently costly. And this integration lets us take it a step further with LUCENE-1313 , where recently created segments can remain in RAM and be shared with the reader. If you have good simplifications/improvements on the approach here, please post them. Some time ago I finished a first version of IR plugins, and enjoy pretty low reopen times (field/facet/filter cache warmups included). (Yes, I'm going to open an issue for plugins once they stabilize enough) I'm confused: I thought that effort was to make SegmentReader's components fully pluggable? (Not to actually change what components SegmentReader is creating). EG does this modularization alter the approach to NRT? I thought they were orthogonal. If we add some generic storable flags for Lucene fields, this is cool (probably), NumericField can then capitalize on it, as well as users writing their own NNNFields. +1 Wanna make a patch? No, I'd like to continue IR cleanup and play with positionIncrement companion value that could enable true multiword synonyms. Well I'm looking forward to seeing your approach on these two!
          Hide
          Uwe Schindler added a comment -

          Patch with all changes, including LUCENE-1687 (it is easier to do this together):

          • Add NumericField, change JavaDocs to prefer this where possible.
          • Change range tests to use NumericField during build of test index, this also tests stored fields with NumericField (much cleaner now)
          • Remove SortField factory from NumericUtils, this class is now almost only for expert users; creating a SortField is possible with SortField ctor using parser instance.
          • Make all parsers in FieldCache public (DEFAULT_XXX_PARSER)
          • Add trie parsers to FieldCache, too (NUMERIC_UTILS_XXX_PARSER)
          • Hide StopFillCacheException-hack
          • Change SortField to automatically initialize the correct parser according to the type (defaults to text-only parsers) – there is still some good javadocs missing to tell the user, that it is better to use SortField(String, Parser) instead of SortField(String, type-int) for numeric values, especially when indexed using NumericField.
          • Because SortField is serializable, all parsers were made singletons and serializable, too (superinterface Parser extends Serializable, default parsers define readResolve() to enforce singletons, which are important for FieldCache to work correctly)
          • Remove now unneeded (parser==null) checks in sorting code, as SortField enforces a non-null parser now.
          • Remove all code from ExtendedFieldCache and move to FieldCache (see LUCENE-1687), keep a stub for binary backwards-compatibility. The only implementation is now FieldCacheImpl referred to by DEFAULT (and EXT_DEFAULT for bw).

          A short note: SortField is only serializable, if all custom comparators used are also serializable, maybe we should also note this in the docs. Parsers are automatically serializable (because superinterface), but not automatically real singletons (but this is not Lucenes problem).

          Show
          Uwe Schindler added a comment - Patch with all changes, including LUCENE-1687 (it is easier to do this together): Add NumericField, change JavaDocs to prefer this where possible. Change range tests to use NumericField during build of test index, this also tests stored fields with NumericField (much cleaner now) Remove SortField factory from NumericUtils, this class is now almost only for expert users; creating a SortField is possible with SortField ctor using parser instance. Make all parsers in FieldCache public (DEFAULT_XXX_PARSER) Add trie parsers to FieldCache, too (NUMERIC_UTILS_XXX_PARSER) Hide StopFillCacheException-hack Change SortField to automatically initialize the correct parser according to the type (defaults to text-only parsers) – there is still some good javadocs missing to tell the user, that it is better to use SortField(String, Parser) instead of SortField(String, type-int) for numeric values, especially when indexed using NumericField. Because SortField is serializable, all parsers were made singletons and serializable, too (superinterface Parser extends Serializable, default parsers define readResolve() to enforce singletons, which are important for FieldCache to work correctly) Remove now unneeded (parser==null) checks in sorting code, as SortField enforces a non-null parser now. Remove all code from ExtendedFieldCache and move to FieldCache (see LUCENE-1687 ), keep a stub for binary backwards-compatibility. The only implementation is now FieldCacheImpl referred to by DEFAULT (and EXT_DEFAULT for bw). A short note: SortField is only serializable, if all custom comparators used are also serializable, maybe we should also note this in the docs. Parsers are automatically serializable (because superinterface), but not automatically real singletons (but this is not Lucenes problem).
          Hide
          Uwe Schindler added a comment -

          I know you will kill me, Yonik, and Mike will love me but there is a possibility to also support Trie fields with standard SortField.XXX constants using autodetection. Trie fields always start with a shift-prefix defining the type and so for sure contain non-digits. So FieldCache could simply test and catch NumberFormatException.

          So maybe this would be an option, to make the default (parser== null in FieldCaches getInts(),...) detect this automatically.

          Users then could use SortField/FieldCache as before, ignoring the real encoding. If I would implement this, I could remove the enforcing to parser==null in SortField again and make FieldCache do the detection in this case.

          Show
          Uwe Schindler added a comment - I know you will kill me, Yonik, and Mike will love me but there is a possibility to also support Trie fields with standard SortField.XXX constants using autodetection. Trie fields always start with a shift-prefix defining the type and so for sure contain non-digits. So FieldCache could simply test and catch NumberFormatException. So maybe this would be an option, to make the default (parser== null in FieldCaches getInts(),...) detect this automatically. Users then could use SortField/FieldCache as before, ignoring the real encoding. If I would implement this, I could remove the enforcing to parser==null in SortField again and make FieldCache do the detection in this case.
          Hide
          Uwe Schindler added a comment -

          The last patch was still not 100% backwards compatible, now it is. The modified test-tag TestExtendedFieldCache shows it, it will be committed to backwards-compatibility branch

          Show
          Uwe Schindler added a comment - The last patch was still not 100% backwards compatible, now it is. The modified test-tag TestExtendedFieldCache shows it, it will be committed to backwards-compatibility branch
          Hide
          Yonik Seeley added a comment -

          I'll quote myself, and then attempt to not repeat myself further after this point (the back and forth is silly).

          The next step after adding NumericField seems to be "it's a bug if getDocument() doesn't return a NumericField, so we must encode it in the index". If that's the case, I'm -1 on adding NumericField in the first place.

          Everyone thinks good APIs, good architecture, and good performance is important. It imply otherwise is also silly.

          Show
          Yonik Seeley added a comment - I'll quote myself, and then attempt to not repeat myself further after this point (the back and forth is silly). The next step after adding NumericField seems to be "it's a bug if getDocument() doesn't return a NumericField, so we must encode it in the index". If that's the case, I'm -1 on adding NumericField in the first place. Everyone thinks good APIs, good architecture, and good performance is important. It imply otherwise is also silly.
          Hide
          Michael McCandless added a comment -

          there is a possibility to also support Trie fields with standard SortField.XXX constants using autodetection.

          Meaning we wouldn't be forced to specify FieldCache.DEFAULT_INT_PARSER/FieldCache.NUMERIC_UTILS_INT_PARSER when creating SortField or calling FieldCache.getInts? And we'd make the core parsers package private again?

          So users could simply do SortField("price", SortField.FLOAT) and it'd just work?

          I think this is is compelling! Why not take this approach? There would then be no user visible changes to how you sort by numeric fields...

          Show
          Michael McCandless added a comment - there is a possibility to also support Trie fields with standard SortField.XXX constants using autodetection. Meaning we wouldn't be forced to specify FieldCache.DEFAULT_INT_PARSER/FieldCache.NUMERIC_UTILS_INT_PARSER when creating SortField or calling FieldCache.getInts? And we'd make the core parsers package private again? So users could simply do SortField("price", SortField.FLOAT) and it'd just work? I think this is is compelling! Why not take this approach? There would then be no user visible changes to how you sort by numeric fields...
          Hide
          Uwe Schindler added a comment - - edited

          Yes, it would work this way. This only would violate Yoniks complaints about not miximg Trie too much into the other code, but this is already done because of this StopFillCacheException usage. When we do LUCENE-831, this should be thought about, too.

          SortField.AUTO is deprecated and will not be changed (only detect text numbers). There should be a note, that it would not work with the "new" NumericFields.

          I would make the core parsers public to enable users to have full control (on the other hand I could now hide also the trie parsers). But this is a bad approach, wherever automatisms are envolved, oneshould always have the possibility to fix to one parser. And why do we have the SortField/FieldCache accessors with parser parameter, when you cannot even use the default ones?

          P.S.: About payloads & positions and the need for extra parameters to the parser: After adding support for positions or payloads to encode the highest precision, there is still no need for an extra SortField/Parser class or factory. The future "ValueSource" starts to decode the values until a change in shift occurs. This first shift is for sure the highest precision (because of term ordering), if it is 0, its like now (no payloads/prositions); if the first visible shift>0, payloads/positions were used and the numbero of bits there is also known.

          Show
          Uwe Schindler added a comment - - edited Yes, it would work this way. This only would violate Yoniks complaints about not miximg Trie too much into the other code, but this is already done because of this StopFillCacheException usage. When we do LUCENE-831 , this should be thought about, too. SortField.AUTO is deprecated and will not be changed (only detect text numbers). There should be a note, that it would not work with the "new" NumericFields. I would make the core parsers public to enable users to have full control (on the other hand I could now hide also the trie parsers). But this is a bad approach, wherever automatisms are envolved, oneshould always have the possibility to fix to one parser. And why do we have the SortField/FieldCache accessors with parser parameter, when you cannot even use the default ones? P.S.: About payloads & positions and the need for extra parameters to the parser: After adding support for positions or payloads to encode the highest precision, there is still no need for an extra SortField/Parser class or factory. The future "ValueSource" starts to decode the values until a change in shift occurs. This first shift is for sure the highest precision (because of term ordering), if it is 0, its like now (no payloads/prositions); if the first visible shift>0, payloads/positions were used and the numbero of bits there is also known.
          Hide
          Uwe Schindler added a comment -

          I'll quote myself, and then attempt to not repeat myself further after this point (the back and forth is silly).

          The next step after adding NumericField seems to be "it's a bug if getDocument() doesn't return a NumericField, so we must encode it in the index". If that's the case, I'm -1 on adding NumericField in the first place.

          It is mentioned in the docs, that this class is for indexing only:

          * <p><b>Please note:</b> This class is only used during indexing. You can also create
          * numeric stored fields with it, but when retrieving the stored field value
          * from a {@link Document} instance after search, you will get a conventional
          * {@link Fieldable} instance where the numeric values are returned as {@link String}s
          * (according to <code>toString(value)</code> of the used data type).
          

          In my opinion: Storing this info in the segments is not doable without pitfalls: If somebody indexes a normal field name in one IndexWriter session and starts to index using NumericFiled in the next session, he would have two segments with different encoding and two different "flags". When these two segments are merged later, what do with the flag?

          If we want to have such Schemas, they must be index wide and we have no possibility in Lucene for that at the moment.

          If somebody creates a schema, that can do this (by storing the schema in a separate file next to the segments file), we can think about it again (with all problems, like: MultiReader on top of two indexes with different schemas - forbid that because schema different?). All this says me, we should not do this, it is the task of Solr, my own project panFMP, or Earwin's own schema, to enforce it.

          Show
          Uwe Schindler added a comment - I'll quote myself, and then attempt to not repeat myself further after this point (the back and forth is silly). The next step after adding NumericField seems to be "it's a bug if getDocument() doesn't return a NumericField, so we must encode it in the index". If that's the case, I'm -1 on adding NumericField in the first place. It is mentioned in the docs, that this class is for indexing only: * <p><b>Please note:</b> This class is only used during indexing. You can also create * numeric stored fields with it, but when retrieving the stored field value * from a {@link Document} instance after search, you will get a conventional * {@link Fieldable} instance where the numeric values are returned as {@link String }s * (according to <code>toString(value)</code> of the used data type). In my opinion: Storing this info in the segments is not doable without pitfalls: If somebody indexes a normal field name in one IndexWriter session and starts to index using NumericFiled in the next session, he would have two segments with different encoding and two different "flags". When these two segments are merged later, what do with the flag? If we want to have such Schemas, they must be index wide and we have no possibility in Lucene for that at the moment. If somebody creates a schema, that can do this (by storing the schema in a separate file next to the segments file), we can think about it again (with all problems, like: MultiReader on top of two indexes with different schemas - forbid that because schema different?). All this says me, we should not do this, it is the task of Solr, my own project panFMP, or Earwin's own schema, to enforce it.
          Hide
          Uwe Schindler added a comment -

          Here a patch with the auto-detection. All tests pass, also backwards-tests. The parsers are still public for total control on the conversion.

          Show
          Uwe Schindler added a comment - Here a patch with the auto-detection. All tests pass, also backwards-tests. The parsers are still public for total control on the conversion.
          Hide
          Michael McCandless added a comment -

          In my opinion: Storing this info in the segments is not doable without pitfalls:

          The proposal was not to store it into the segments file (which I agree it has serious problems, since it's global). I had considered FieldInfos (which is "roughly" Lucene's "schema", per segment), but that too has clear problems.

          My proposal was the flags per-field stored in the fdt file. In that file, we are already writing one byte's worth of flags (only 3 of the bits are used now), for every stored field instance. This is in FieldsWriter.java ~ line 181. The flags now record whether each specific field instance was tokenized, compressed, binary. FieldsReader then uses these flags to reconstruct the Field instances when building the document. This bits are never merged; they are copied (because they apply to that one field instance, in that one document).

          My proposal was to add another flag bit (numeric) and make use of that to return a NumericField instance when you get your document back. It would have no impact to the index size, since we still have 5 free bits to use.

          But, it is technically a (one bit) change to the index format, which people seriously objected to. So net/net I'm OK going forward without it.

          Show
          Michael McCandless added a comment - In my opinion: Storing this info in the segments is not doable without pitfalls: The proposal was not to store it into the segments file (which I agree it has serious problems, since it's global). I had considered FieldInfos (which is "roughly" Lucene's "schema", per segment), but that too has clear problems. My proposal was the flags per-field stored in the fdt file. In that file, we are already writing one byte's worth of flags (only 3 of the bits are used now), for every stored field instance. This is in FieldsWriter.java ~ line 181. The flags now record whether each specific field instance was tokenized, compressed, binary. FieldsReader then uses these flags to reconstruct the Field instances when building the document. This bits are never merged; they are copied (because they apply to that one field instance, in that one document). My proposal was to add another flag bit (numeric) and make use of that to return a NumericField instance when you get your document back. It would have no impact to the index size, since we still have 5 free bits to use. But, it is technically a (one bit) change to the index format, which people seriously objected to. So net/net I'm OK going forward without it.
          Hide
          Uwe Schindler added a comment -

          Here an updated patch:

          • fix some copy/paste duplicates in assignments
          • add extra check to fail early when autodetection fails

          The patch (and the one before) checks the autodetection two times (so you see in the trie tests, how one would use SortField with trie), what do you think about all this?

          But, it is technically a (one bit) change to the index format, which people seriously objected to. So net/net I'm OK going forward without it.

          I agree, and for the reasons noted before, I would like to see NumericField as only a helper for indexing. Storing the number as plain-text string (as done in the patch) does not justify a NumericField on getting stored fields.

          When Yonik committed LUCENE-1699, I would do some additional NumericField fine tuning, but the patch is finished now.

          Show
          Uwe Schindler added a comment - Here an updated patch: fix some copy/paste duplicates in assignments add extra check to fail early when autodetection fails The patch (and the one before) checks the autodetection two times (so you see in the trie tests, how one would use SortField with trie), what do you think about all this? But, it is technically a (one bit) change to the index format, which people seriously objected to. So net/net I'm OK going forward without it. I agree, and for the reasons noted before, I would like to see NumericField as only a helper for indexing. Storing the number as plain-text string (as done in the patch) does not justify a NumericField on getting stored fields. When Yonik committed LUCENE-1699 , I would do some additional NumericField fine tuning, but the patch is finished now.
          Hide
          Michael McCandless added a comment -

          Uwe, with your patch, it looks like if I ask for eg doubles w/ parser=null, and then ask again w/ parser=FieldCache.DEFAULT_DOUBLE_PARSER, I get double entries stored? Is there some way to take the auto-detected parser and use it (not null) in the cache key?

          Show
          Michael McCandless added a comment - Uwe, with your patch, it looks like if I ask for eg doubles w/ parser=null, and then ask again w/ parser=FieldCache.DEFAULT_DOUBLE_PARSER, I get double entries stored? Is there some way to take the auto-detected parser and use it (not null) in the cache key?
          Hide
          Uwe Schindler added a comment -

          Uwe, with your patch, it looks like if I ask for eg doubles w/ parser=null, and then ask again w/ parser=FieldCache.DEFAULT_DOUBLE_PARSER, I get double entries stored? Is there some way to take the auto-detected parser and use it (not null) in the cache key?

          This is correct, the cache key is different for NULL vs. explicit parser, because the result may be different (which is unlikely the case) but they are two different things. When you ask for an auto-cache (we should deprecate the AUTO-part in field cache, too! It is not used anymore with new sorting code, even when SortField.AUTO is enabled), you also get different cache keys!

          Show
          Uwe Schindler added a comment - Uwe, with your patch, it looks like if I ask for eg doubles w/ parser=null, and then ask again w/ parser=FieldCache.DEFAULT_DOUBLE_PARSER, I get double entries stored? Is there some way to take the auto-detected parser and use it (not null) in the cache key? This is correct, the cache key is different for NULL vs. explicit parser, because the result may be different (which is unlikely the case) but they are two different things. When you ask for an auto-cache (we should deprecate the AUTO-part in field cache, too! It is not used anymore with new sorting code, even when SortField.AUTO is enabled), you also get different cache keys!
          Hide
          Michael McCandless added a comment -

          But, with the new public exposure of the field cache parsers, this is a newly added trap? You could accidentally consume 2X the RAM. Since auto-detection of the parser simply means a specific parser was chosen, why can't we then cache using that chosen parser? Then you would not risk 2X memory usage. Or, maybe we should leave the parsers private to not have this risk.

          Show
          Michael McCandless added a comment - But, with the new public exposure of the field cache parsers, this is a newly added trap? You could accidentally consume 2X the RAM. Since auto-detection of the parser simply means a specific parser was chosen, why can't we then cache using that chosen parser? Then you would not risk 2X memory usage. Or, maybe we should leave the parsers private to not have this risk.
          Hide
          Uwe Schindler added a comment - - edited

          Not sure! And it is a trap for the BYTE and SHORT caches, too!

          But for sure, the old, never used auto-cache should be deprecated, too!

          Show
          Uwe Schindler added a comment - - edited Not sure! And it is a trap for the BYTE and SHORT caches, too! But for sure, the old, never used auto-cache should be deprecated, too!
          Hide
          Uwe Schindler added a comment -

          Here a new patch:

          1. deprecated the AUTO cache parts
          2. the test TestFieldCache tests the following algorithm, too
          3. changed autodetection for triefields to work like the auto cache:
          • in createValue, it is checked if parser==null, if this is so, the method calls the FieldCache.getInts() again, specifying each of the two possible parsers (for byte and short currently only one).
          • After this, the cache will then contain the same array reference for both key variants ("finally used parser" and "null"). Somebody coming later and asking for a specific parser array will get the already cached one, the same for later consumers asking with null parser.
          • To optimize the array allocation, it was delayed until the first value was successfully decoded. In case of an error, the array stays null and GC is happy
          Show
          Uwe Schindler added a comment - Here a new patch: deprecated the AUTO cache parts the test TestFieldCache tests the following algorithm, too changed autodetection for triefields to work like the auto cache: in createValue, it is checked if parser==null, if this is so, the method calls the FieldCache.getInts() again, specifying each of the two possible parsers (for byte and short currently only one). After this, the cache will then contain the same array reference for both key variants ("finally used parser" and "null"). Somebody coming later and asking for a specific parser array will get the already cached one, the same for later consumers asking with null parser. To optimize the array allocation, it was delayed until the first value was successfully decoded. In case of an error, the array stays null and GC is happy
          Hide
          Michael McCandless added a comment -

          The last patch looks great Uwe! I think we're nearly done here:

          • I like how FieldComparator, and function/*FieldSource, are now
            simplified to longer have duplicated code for picking the default
            parser (and simply pass null instead). Default selection is now
            single source.
          • Could you add the "<b>NOTE:</b> This API is experimental and might
            change in incompatible ways in the next release." caveat to the
            javadocs?
          • Can you change this:
            if (parser == null) try {
              return getLongs(reader, field, DEFAULT_LONG_PARSER);
            } catch (NumberFormatException ne) {
              return getLongs(reader, field, NUMERIC_UTILS_LONG_PARSER);      
            }
            

            to this:

            if (parser == null) {
              try {
                return getLongs(reader, field, DEFAULT_LONG_PARSER);
              } catch (NumberFormatException ne) {
                return getLongs(reader, field, NUMERIC_UTILS_LONG_PARSER);      
              }
            }
            

            ?

          • I think we can't actually deprecate NumberTools until we can call
            FieldCache.getShorts/getBytes on a NumericField? Ie, people
            relying on short/byte (to consume much less memory in FieldCache)
            cannot switch to numeric, and so must continue to zero-pad if they
            need to use RangeQuery/Filter?
          • You need a CHANGES entry.
          Show
          Michael McCandless added a comment - The last patch looks great Uwe! I think we're nearly done here: I like how FieldComparator, and function/*FieldSource, are now simplified to longer have duplicated code for picking the default parser (and simply pass null instead). Default selection is now single source. Could you add the "<b>NOTE:</b> This API is experimental and might change in incompatible ways in the next release." caveat to the javadocs? Can you change this: if (parser == null ) try { return getLongs(reader, field, DEFAULT_LONG_PARSER); } catch (NumberFormatException ne) { return getLongs(reader, field, NUMERIC_UTILS_LONG_PARSER); } to this: if (parser == null ) { try { return getLongs(reader, field, DEFAULT_LONG_PARSER); } catch (NumberFormatException ne) { return getLongs(reader, field, NUMERIC_UTILS_LONG_PARSER); } } ? I think we can't actually deprecate NumberTools until we can call FieldCache.getShorts/getBytes on a NumericField? Ie, people relying on short/byte (to consume much less memory in FieldCache) cannot switch to numeric, and so must continue to zero-pad if they need to use RangeQuery/Filter? You need a CHANGES entry.
          Hide
          Uwe Schindler added a comment - - edited

          Could you add the "<b>NOTE:</b> This API is experimental and might change in incompatible ways in the next release." caveat to the javadocs?

          For the whole TrieAPI or only NumericUtils? If the first, I would do this with an general JavaDoc commit after this issue. If only NumericField, I could do it now.

          Can you change this:

          Sure, we had this the last time, too (I like my variant more, so I always automatically write it in that way)

          I think we can't actually deprecate NumberTools until we can call FieldCache.getShorts/getBytes on a NumericField? Ie, people relying on short/byte (to consume much less memory in FieldCache) cannot switch to numeric, and so must continue to zero-pad if they need to use RangeQuery/Filter?

          I will open an issue because of this byte/short trie fields (LUCENE-1710)

          NumberTools does not handle zero-padding, so it could stay deprecated. Numbers encoded with NumberTools cannot be natively sorted on at all (only using StringIndex) and can only handle longs.

          You may mean NumberUtils from Solr in contrib/spatial, but this class is not yet released and is only used for spatial.

          You need a CHANGES entry.

          Will come.

          Show
          Uwe Schindler added a comment - - edited Could you add the "<b>NOTE:</b> This API is experimental and might change in incompatible ways in the next release." caveat to the javadocs? For the whole TrieAPI or only NumericUtils? If the first, I would do this with an general JavaDoc commit after this issue. If only NumericField, I could do it now. Can you change this: Sure, we had this the last time, too (I like my variant more, so I always automatically write it in that way) I think we can't actually deprecate NumberTools until we can call FieldCache.getShorts/getBytes on a NumericField? Ie, people relying on short/byte (to consume much less memory in FieldCache) cannot switch to numeric, and so must continue to zero-pad if they need to use RangeQuery/Filter? I will open an issue because of this byte/short trie fields ( LUCENE-1710 ) NumberTools does not handle zero-padding, so it could stay deprecated. Numbers encoded with NumberTools cannot be natively sorted on at all (only using StringIndex) and can only handle longs. You may mean NumberUtils from Solr in contrib/spatial, but this class is not yet released and is only used for spatial. You need a CHANGES entry. Will come.
          Hide
          Michael McCandless added a comment -

          Could you add the "<b>NOTE:</b> This API is experimental and might change in incompatible ways in the next release." caveat to the javadocs?

          For the whole TrieAPI or only NumericUtils? If the first, I would do this with an general JavaDoc commit after this issue. If only NumericField, I could do it now.

          For the whole thing; I think an added NOTE at each class level
          javadoc is what we need. A separate javadoc issue is good, but
          it needs to be done for 2.9.

          Can you change this:

          Sure, we had this the last time, too (I like my variant more, so I always automatically write it in that way)

          Which "last time"? Is there somewhere in the code base now where
          we do this?

          We generally try (though, don't always succeed) to follow Sun's coding
          guidelines (http://java.sun.com/docs/codeconv/) except 2-space indent
          not 4.

          I think we can't actually deprecate NumberTools until we can call FieldCache.getShorts/getBytes on a NumericField? Ie, people relying on short/byte (to consume much less memory in FieldCache) cannot switch to numeric, and so must continue to zero-pad if they need to use RangeQuery/Filter?

          NumberTools does not handle zero-padding, so it could stay deprecated. Numbers encoded with NumberTools cannot be natively sorted on at all (only using StringIndex) and can only handle longs.

          Actually, I believe it does do 0 padding and handles negative numbers
          correctly (NumberTools.longToString)?

          Ie, I can take a short now, call longToString, index with that, do
          [possibly inefficient] RangeQuery against it, and sort against it
          using only 2 bytes per doc, today?

          I will open an issue because of this byte/short trie fields (LUCENE-1710)

          OK but since we've marked it 3.1 (which I think is OK; though in
          CHANGES lets document the limitation?), we should un-deprecate
          NumberTools, now, and deprecate it again along with LUCENE-1710?

          Show
          Michael McCandless added a comment - Could you add the "<b>NOTE:</b> This API is experimental and might change in incompatible ways in the next release." caveat to the javadocs? For the whole TrieAPI or only NumericUtils? If the first, I would do this with an general JavaDoc commit after this issue. If only NumericField, I could do it now. For the whole thing; I think an added NOTE at each class level javadoc is what we need. A separate javadoc issue is good, but it needs to be done for 2.9. Can you change this: Sure, we had this the last time, too (I like my variant more, so I always automatically write it in that way) Which "last time"? Is there somewhere in the code base now where we do this? We generally try (though, don't always succeed) to follow Sun's coding guidelines ( http://java.sun.com/docs/codeconv/ ) except 2-space indent not 4. I think we can't actually deprecate NumberTools until we can call FieldCache.getShorts/getBytes on a NumericField? Ie, people relying on short/byte (to consume much less memory in FieldCache) cannot switch to numeric, and so must continue to zero-pad if they need to use RangeQuery/Filter? NumberTools does not handle zero-padding, so it could stay deprecated. Numbers encoded with NumberTools cannot be natively sorted on at all (only using StringIndex) and can only handle longs. Actually, I believe it does do 0 padding and handles negative numbers correctly (NumberTools.longToString)? Ie, I can take a short now, call longToString, index with that, do [possibly inefficient] RangeQuery against it, and sort against it using only 2 bytes per doc, today? I will open an issue because of this byte/short trie fields ( LUCENE-1710 ) OK but since we've marked it 3.1 (which I think is OK; though in CHANGES lets document the limitation?), we should un-deprecate NumberTools, now, and deprecate it again along with LUCENE-1710 ?
          Hide
          Uwe Schindler added a comment -

          Sure, we had this the last time, too (I like my variant more, so I always automatically write it in that way)

          Which "last time"? Is there somewhere in the code base now where
          we do this?

          We generally try (though, don't always succeed) to follow Sun's coding
          guidelines (http://java.sun.com/docs/codeconv/) except 2-space indent
          not 4.

          This was not against the change. With "last time" I meant that some time ago you mentioned the same in a different patch from me. I will change it.

          My note was only, that I "automatically" create such code, because I for myself find its better readable. That was all

          NumberTools does not handle zero-padding, so it could stay deprecated. Numbers encoded with NumberTools cannot be natively sorted on at all (only using StringIndex) and can only handle longs.

          Actually, I believe it does do 0 padding and handles negative numbers
          correctly (NumberTools.longToString)?

          Ie, I can take a short now, call longToString, index with that, do
          [possibly inefficient] RangeQuery against it, and sort against it
          using only 2 bytes per doc, today?

          You cannot do this with NumberTools. NumberTools uses an special radix 36 encoding (and not radix 10 like normal numbers). The encoding is just like NumericUtils not human-readable and so cannot be parsed with Number.toString(). To convert back, you need the method from the same class.

          Because of this you have two possilities: Write your own parser and pass it to SortField/FieldCache or sort using StringIndex (because it is sortable according to String.compareTo).

          So it can be deprecated.

          If sombody want to do encoding and parsing with FieldCache.getShorts() there is no way around a DecimalFormat with zero-padding and the problem with negative numbers.

          Show
          Uwe Schindler added a comment - Sure, we had this the last time, too (I like my variant more, so I always automatically write it in that way) Which "last time"? Is there somewhere in the code base now where we do this? We generally try (though, don't always succeed) to follow Sun's coding guidelines ( http://java.sun.com/docs/codeconv/ ) except 2-space indent not 4. This was not against the change. With "last time" I meant that some time ago you mentioned the same in a different patch from me. I will change it. My note was only, that I "automatically" create such code, because I for myself find its better readable. That was all NumberTools does not handle zero-padding, so it could stay deprecated. Numbers encoded with NumberTools cannot be natively sorted on at all (only using StringIndex) and can only handle longs. Actually, I believe it does do 0 padding and handles negative numbers correctly (NumberTools.longToString)? Ie, I can take a short now, call longToString, index with that, do [possibly inefficient] RangeQuery against it, and sort against it using only 2 bytes per doc, today? You cannot do this with NumberTools. NumberTools uses an special radix 36 encoding (and not radix 10 like normal numbers). The encoding is just like NumericUtils not human-readable and so cannot be parsed with Number.toString(). To convert back, you need the method from the same class. Because of this you have two possilities: Write your own parser and pass it to SortField/FieldCache or sort using StringIndex (because it is sortable according to String.compareTo). So it can be deprecated. If sombody want to do encoding and parsing with FieldCache.getShorts() there is no way around a DecimalFormat with zero-padding and the problem with negative numbers.
          Hide
          Michael McCandless added a comment -

          You cannot do this with NumberTools.

          Whoa! Sorry, you are right. Nothing in FieldCache can handle decoding
          this encoding into short/byte; so you'd need something outside of
          Lucene's core, today, to do that.

          Though it does allow you to do RangeQuery, with short/byte, and you
          could sort as SortField.STRING, (quite hideously memory inefficient).

          So I now agree: we should deprecate NumberTools entirely, now.

          If people ask how to handle short/byte beforew we resolve LUCENE-1710,
          the answer is to upgrade everything to int. The resulting wasteful
          int[] that'd be in the FieldCache is not nearly as wasteful as the
          String[] you'd need to use to do sorting, today. And you could always
          make a custom parser using NumericUtils that downcasts to byte/short,
          anyway, since the needed APIs are public.

          Show
          Michael McCandless added a comment - You cannot do this with NumberTools. Whoa! Sorry, you are right. Nothing in FieldCache can handle decoding this encoding into short/byte; so you'd need something outside of Lucene's core, today, to do that. Though it does allow you to do RangeQuery, with short/byte, and you could sort as SortField.STRING, (quite hideously memory inefficient). So I now agree: we should deprecate NumberTools entirely, now. If people ask how to handle short/byte beforew we resolve LUCENE-1710 , the answer is to upgrade everything to int. The resulting wasteful int[] that'd be in the FieldCache is not nearly as wasteful as the String[] you'd need to use to do sorting, today. And you could always make a custom parser using NumericUtils that downcasts to byte/short, anyway, since the needed APIs are public.
          Hide
          Uwe Schindler added a comment -

          Attached is a patch wil the latest changes:

          • Experimental-warning to all Numeric* classes
          • Formatting fixes
          • CHANGES.txt
          Show
          Uwe Schindler added a comment - Attached is a patch wil the latest changes: Experimental-warning to all Numeric* classes Formatting fixes CHANGES.txt
          Hide
          Michael McCandless added a comment -

          Uwe, can we give a default (4?) for precisionStep, when creating a NumericField, NumericRangeFilter/Query?

          Show
          Michael McCandless added a comment - Uwe, can we give a default (4?) for precisionStep, when creating a NumericField, NumericRangeFilter/Query?
          Hide
          Uwe Schindler added a comment -

          Uwe, can we give a default (4?) for precisionStep, when creating a NumericField, NumericRangeFilter/Query? If we find a good default, it can simplier applied in one issue (just some additional ctors and factories).

          Can you open an issue? There are some problems with defining a good default. In my environment, 8 makes the best results, 4 is only little faster.
          Problems are described in JavaDocs: smaller precisionStep -> more different precisions:

          • more seek operations and new TermEnums
          • but less terms
            For my index with 8 is in good relation to each other, with 4 seeking gets costly and with 2, I see no difference, only a much larger index (600,000 docs on PANGAEA, 4 Mio doc index locally on Laptop with only trie numbers).

          So I do not want to set a default with enough tests from different people/scenarios, and this comes when 2.9 is out and everybody tries out

          I will commit this patch in a day or two after applying LUCENE-1701-test-tag-special.patch to 2.4-test-branch.

          Show
          Uwe Schindler added a comment - Uwe, can we give a default (4?) for precisionStep, when creating a NumericField, NumericRangeFilter/Query? If we find a good default, it can simplier applied in one issue (just some additional ctors and factories). Can you open an issue? There are some problems with defining a good default. In my environment, 8 makes the best results, 4 is only little faster. Problems are described in JavaDocs: smaller precisionStep -> more different precisions: more seek operations and new TermEnums but less terms For my index with 8 is in good relation to each other, with 4 seeking gets costly and with 2, I see no difference, only a much larger index (600,000 docs on PANGAEA, 4 Mio doc index locally on Laptop with only trie numbers). So I do not want to set a default with enough tests from different people/scenarios, and this comes when 2.9 is out and everybody tries out I will commit this patch in a day or two after applying LUCENE-1701 -test-tag-special.patch to 2.4-test-branch.
          Hide
          Earwin Burrfoot added a comment - - edited

          Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8.

          Though, if you want really fast dates, chosing hour/day/month/year as precision steps is vastly superior, plus it also clicks well with user-selected ranges. Still, I dumped this approach for uniformity and clarity.

          Show
          Earwin Burrfoot added a comment - - edited Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8. Though, if you want really fast dates, chosing hour/day/month/year as precision steps is vastly superior, plus it also clicks well with user-selected ranges. Still, I dumped this approach for uniformity and clarity.
          Hide
          Uwe Schindler added a comment - - edited

          Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8.

          I think 4 for ints is a good start, better as 4 for longs (which produces 16 different precision terms and upto 31 term enums [= precision changes] per range). 6 is a good idea, it brings a little bit more than 8 but does not produce too much precision changes. I tested that also with my 2 M numeric-only index here.

          Mike: As you see, the precision step is a good config approach, so an default is should be choosen carefully.
          It may even be different for the same data type, when e.g. you have longs, but all longs in your index are only in a very limited range – ok. You could use an int, too. But e.g. if you index dates as long and your dates are only between two years or something like that, 4 may still good. This is because on a smaller range, the algorith does not need to to up to the lowest precision.

          Though, if you want really fast dates, chosing hour/day/month/year as precision steps is vastly superior, plus it also clicks well with user-selected ranges. Still, I dumped this approach for uniformity and clarity.

          That is clear. Because these precisions are fitting exact to users queries in case of dates (often users take full days when selecting the range).

          Nice to hear, that you use TrieRange? What is your index spec and measured query speeds (if it does not go too far into company internals)?

          Show
          Uwe Schindler added a comment - - edited Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8. I think 4 for ints is a good start, better as 4 for longs (which produces 16 different precision terms and upto 31 term enums [= precision changes] per range). 6 is a good idea, it brings a little bit more than 8 but does not produce too much precision changes. I tested that also with my 2 M numeric-only index here. Mike: As you see, the precision step is a good config approach, so an default is should be choosen carefully. It may even be different for the same data type, when e.g. you have longs, but all longs in your index are only in a very limited range – ok. You could use an int, too. But e.g. if you index dates as long and your dates are only between two years or something like that, 4 may still good. This is because on a smaller range, the algorith does not need to to up to the lowest precision. Though, if you want really fast dates, chosing hour/day/month/year as precision steps is vastly superior, plus it also clicks well with user-selected ranges. Still, I dumped this approach for uniformity and clarity. That is clear. Because these precisions are fitting exact to users queries in case of dates (often users take full days when selecting the range). Nice to hear, that you use TrieRange? What is your index spec and measured query speeds (if it does not go too far into company internals)?
          Hide
          Earwin Burrfoot added a comment -

          >>> Design for today.
          >> And spend two years deprecating and supporting today's designs after you get a better thing tomorrow. Back-compat Lucene-style and agile design aren't something that marries well.
          >> donating something to Lucene means casting it in concrete.
          > We can't let fear of back-compat prevent us from making progress.
          My point was that strict back-compat prevents people from donating work which is not yet finalized. They either lose comfortable volatility of private code, or have to maintain two versions of it - private and Lucene.

          >> NRT seems to tread the same path, and I'm not sure it's going to win that much turnaround time after newly-introduced per-segment collection.
          > I agree, per-segment collection was the bulk of the gains needed for
          > NRT. This was a big change and a huge step forward in simple reopen
          > turnaround.
          I vote it for the most frustrating (in terms of adopting your custom code) and most useful change of 2.9

          > But, not having to write & read deletes to disk, not commit (fsync)
          > from writer in order to see those changes in reader should also give
          > us decent gains. fsync is surprisingly and intermittently costly.
          I'm not sure this can't be achieved without messing with IR/W guts so much. Guys from LinkedIn that drive this feature (if i'm not mistaken), they had a prior solution with separate indexes, one on disk, one in RAM. Per-segment collection adds superfast reopens and MultiReader that is way greater than MultiSearcher - you can finally do adequate fast searches across separate indexes. Do we still need to add complexity for minor performance gains?

          > And this integration lets us take it a step further with LUCENE-1313,
          > where recently created segments can remain in RAM and be shared with
          > the reader.
          RAMDirectory?

          >> Some time ago I finished a first version of IR plugins, and enjoy pretty low reopen times (field/facet/filter cache warmups included). (Yes, I'm going to open an issue for plugins once they stabilize enough)
          > I'm confused: I thought that effort was to make SegmentReader's
          > components fully pluggable? (Not to actually change what components
          > SegmentReader is creating). EG does this modularization alter the
          > approach to NRT? I thought they were orthogonal.
          Yes, they are orthonogal. This was yet another praise to per-segment collection and an example of how this approach can be extended on your custom stuff (like filtercache).

          Show
          Earwin Burrfoot added a comment - >>> Design for today. >> And spend two years deprecating and supporting today's designs after you get a better thing tomorrow. Back-compat Lucene-style and agile design aren't something that marries well. >> donating something to Lucene means casting it in concrete. > We can't let fear of back-compat prevent us from making progress. My point was that strict back-compat prevents people from donating work which is not yet finalized. They either lose comfortable volatility of private code, or have to maintain two versions of it - private and Lucene. >> NRT seems to tread the same path, and I'm not sure it's going to win that much turnaround time after newly-introduced per-segment collection. > I agree, per-segment collection was the bulk of the gains needed for > NRT. This was a big change and a huge step forward in simple reopen > turnaround. I vote it for the most frustrating (in terms of adopting your custom code) and most useful change of 2.9 > But, not having to write & read deletes to disk, not commit (fsync) > from writer in order to see those changes in reader should also give > us decent gains. fsync is surprisingly and intermittently costly. I'm not sure this can't be achieved without messing with IR/W guts so much. Guys from LinkedIn that drive this feature (if i'm not mistaken), they had a prior solution with separate indexes, one on disk, one in RAM. Per-segment collection adds superfast reopens and MultiReader that is way greater than MultiSearcher - you can finally do adequate fast searches across separate indexes. Do we still need to add complexity for minor performance gains? > And this integration lets us take it a step further with LUCENE-1313 , > where recently created segments can remain in RAM and be shared with > the reader. RAMDirectory? >> Some time ago I finished a first version of IR plugins, and enjoy pretty low reopen times (field/facet/filter cache warmups included). (Yes, I'm going to open an issue for plugins once they stabilize enough) > I'm confused: I thought that effort was to make SegmentReader's > components fully pluggable? (Not to actually change what components > SegmentReader is creating). EG does this modularization alter the > approach to NRT? I thought they were orthogonal. Yes, they are orthonogal. This was yet another praise to per-segment collection and an example of how this approach can be extended on your custom stuff (like filtercache).
          Hide
          Michael McCandless added a comment -

          Using 4 for int, 6 for long.

          Unfortunately we can't easily conditionalize the default by int vs long. Ie use you NumericField like this:

          NumericField f = new NumericField("price", 4);
          f.setFloatValue(15.50);
          

          Mike: As you see, the precision step is a good config approach, so an default is should be choosen carefully.

          Agreed! But, it need not be "perfect". Advanced users can test & iterate to find the best tradeoff for their particular field's value distribution. For slow ranges now with RangeQuery (because of many unique terms), NumericRangeQuery will be a massive speedup with eg a default of 4.

          New users shouldn't have to understand what precisionStep means, or anything about "what's under the hood", in order to use NumericField. I should be able to simply:

          new NumericField("price", 15.50);
          

          Erring towards more terms (and faster searches) is fine, I think, because in a "typical" index the text fields with dwarf any small added disk space (hence my proposal of 4 as the default precisionStep).

          Can you open an issue?

          OK I'll open a new issue.

          Show
          Michael McCandless added a comment - Using 4 for int, 6 for long. Unfortunately we can't easily conditionalize the default by int vs long. Ie use you NumericField like this: NumericField f = new NumericField( "price" , 4); f.setFloatValue(15.50); Mike: As you see, the precision step is a good config approach, so an default is should be choosen carefully. Agreed! But, it need not be "perfect". Advanced users can test & iterate to find the best tradeoff for their particular field's value distribution. For slow ranges now with RangeQuery (because of many unique terms), NumericRangeQuery will be a massive speedup with eg a default of 4. New users shouldn't have to understand what precisionStep means, or anything about "what's under the hood", in order to use NumericField. I should be able to simply: new NumericField( "price" , 15.50); Erring towards more terms (and faster searches) is fine, I think, because in a "typical" index the text fields with dwarf any small added disk space (hence my proposal of 4 as the default precisionStep). Can you open an issue? OK I'll open a new issue.
          Hide
          Michael McCandless added a comment -

          OK I opened spinoff issues LUCENE-1712 (default for precisionStep) and LUCENE-1713 (rename RangeQuery -> TextRangeQuery, or something).

          Show
          Michael McCandless added a comment - OK I opened spinoff issues LUCENE-1712 (default for precisionStep) and LUCENE-1713 (rename RangeQuery -> TextRangeQuery, or something).
          Hide
          Michael McCandless added a comment -

          My point was that strict back-compat prevents people from donating work which is not yet finalized. They either lose comfortable volatility of private code, or have to maintain two versions of it - private and Lucene.

          That's a good point, though if it's contrib and you're a contrib committer it lessens the challenge, but the challenge is still there....

          I vote it for the most frustrating (in terms of adopting your custom code) and most useful change of 2.9

          No pain no gain?

          Do we still need to add complexity for minor performance gains?

          The problem is I've seen fsync take a ridiculous amount of time; it's not very predictable. So I think we do need some way to not put fsync between the changes to the index and the ability to search those changes.

          > And this integration lets us take it a step further with LUCENE-1313,
          > where recently created segments can remain in RAM and be shared with
          > the reader.

          RAMDirectory?

          Exactly; that's what LUCENE-1313 is doing (flush new segments to a RAMDir).

          This was yet another praise to per-segment collection and an example of how this approach can be extended on your custom stuff (like filtercache).

          OK.

          Show
          Michael McCandless added a comment - My point was that strict back-compat prevents people from donating work which is not yet finalized. They either lose comfortable volatility of private code, or have to maintain two versions of it - private and Lucene. That's a good point, though if it's contrib and you're a contrib committer it lessens the challenge, but the challenge is still there.... I vote it for the most frustrating (in terms of adopting your custom code) and most useful change of 2.9 No pain no gain? Do we still need to add complexity for minor performance gains? The problem is I've seen fsync take a ridiculous amount of time; it's not very predictable. So I think we do need some way to not put fsync between the changes to the index and the ability to search those changes. > And this integration lets us take it a step further with LUCENE-1313 , > where recently created segments can remain in RAM and be shared with > the reader. RAMDirectory? Exactly; that's what LUCENE-1313 is doing (flush new segments to a RAMDir). This was yet another praise to per-segment collection and an example of how this approach can be extended on your custom stuff (like filtercache). OK.
          Hide
          Uwe Schindler added a comment -

          Final patch.

          Show
          Uwe Schindler added a comment - Final patch.
          Hide
          Uwe Schindler added a comment -

          Commited backwards tests: revision 787714
          Committed patch: revision 787723

          Thanks Mike!

          Show
          Uwe Schindler added a comment - Commited backwards tests: revision 787714 Committed patch: revision 787723 Thanks Mike!

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development