Lucene - Core
  1. Lucene - Core
  2. LUCENE-1582

Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TrieRange has currently the following problem:

      • To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration
      • TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting)
      • trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved.

      This issue should improve this:

      • trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values.
      • Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything.

      The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a "hack" patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it.

      1. LUCENE-1582.patch
        69 kB
        Uwe Schindler
      2. LUCENE-1582.patch
        78 kB
        Uwe Schindler
      3. ASF.LICENSE.NOT.GRANTED--LUCENE-1582.patch
        37 kB
        Uwe Schindler
      4. ASF.LICENSE.NOT.GRANTED--LUCENE-1582.patch
        67 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          This sounds like a great improvement!

          Show
          Michael McCandless added a comment - This sounds like a great improvement!
          Hide
          Shalin Shekhar Mangar added a comment -

          trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values.

          +1

          Show
          Shalin Shekhar Mangar added a comment - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. +1
          Hide
          Uwe Schindler added a comment -

          A first version of the patch:

          • JavaDocs not finished (examples, documentation) yet
          • New classes: IntTrieTokenStream, LongTrieTokenStream
          • Removed TrieUtils.trieCodeInt/Long()
          • Removed TrieUtils.addIndexFields()
          • Removed all fields[] arrays, now only one field name is supported everywhere

          To index a trie-encoded field, just use (preferred way):

          Filed f=new Field(name, new LongTrieTokenStream(value, precisionStep));
          f.setOmitNorms(true);
          f.setOmitTermFreqAndPositions(true);
          

          (maybe TrieUtils supplies a shortcut helper method that uses these special optimal settings when creating the field, e.g. TrieUtils.newLongTrieField()). This is extensible with TokenFilters, if somebody wants to add payloads and so on.

          This patch also contains the sorting fixes in the core: FieldCache.StopFillCacheException can be thrown from withing the parser. Maybe this should be provides as a separate sub-isse (or top-level issue), because I cannot apply patches to core. Mike, can you do this, when we commit this?

          Yonik: It would be nice to hear some comments from you, too.

          I really like the new way to create trie encoded fields. When this moves to core, the tokenizers can be renamed to IntTokenStream, TrieUtils now only contains the converters to/from doubles and the encoding and range split.

          About the GC note in the description of this issue: The new API does not use so much array allocations and array copies and reuses the Token. But as it is needed to generate a TokenStream instance for every numeric value, the GC cost is about the same for new and old API. Especially because each TokenStream creates a LinkedHashMap internally for the attributes.

          Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream? This is needed to add more than one trie encoded value (which worked with the old API). I just want to be sure.

          Show
          Uwe Schindler added a comment - A first version of the patch: JavaDocs not finished (examples, documentation) yet New classes: IntTrieTokenStream, LongTrieTokenStream Removed TrieUtils.trieCodeInt/Long() Removed TrieUtils.addIndexFields() Removed all fields[] arrays, now only one field name is supported everywhere To index a trie-encoded field, just use (preferred way): Filed f= new Field(name, new LongTrieTokenStream(value, precisionStep)); f.setOmitNorms( true ); f.setOmitTermFreqAndPositions( true ); (maybe TrieUtils supplies a shortcut helper method that uses these special optimal settings when creating the field, e.g. TrieUtils.newLongTrieField()). This is extensible with TokenFilters, if somebody wants to add payloads and so on. This patch also contains the sorting fixes in the core: FieldCache.StopFillCacheException can be thrown from withing the parser. Maybe this should be provides as a separate sub-isse (or top-level issue), because I cannot apply patches to core. Mike, can you do this, when we commit this? Yonik: It would be nice to hear some comments from you, too. I really like the new way to create trie encoded fields. When this moves to core, the tokenizers can be renamed to IntTokenStream, TrieUtils now only contains the converters to/from doubles and the encoding and range split. About the GC note in the description of this issue: The new API does not use so much array allocations and array copies and reuses the Token. But as it is needed to generate a TokenStream instance for every numeric value, the GC cost is about the same for new and old API. Especially because each TokenStream creates a LinkedHashMap internally for the attributes. Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream? This is needed to add more than one trie encoded value (which worked with the old API). I just want to be sure.
          Hide
          Michael McCandless added a comment -

          Maybe this should be provides as a separate sub-isse (or top-level issue), because I cannot apply patches to core. Mike, can you do this, when we commit this?

          It's fine to include these changes in this patch – I can commit them all at once.

          But as it is needed to generate a TokenStream instance for every numeric value, the GC cost is about the same for new and old API. Especially because each TokenStream creates a LinkedHashMap internally for the attributes.

          Hmm, we should do some perf tests to see how big a deal this turns out to be. It'd be nice to get some sort of reuse API working if performance is really hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd presumably have to key on thread & field name. If you do this then probably a shortcut helper method should be the preferred way.

          Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream?

          Each with a different TokenStream instance, right? Yes, this should be fine; the tokens are "logically" concatenated just like multi-valued String fields.

          Show
          Michael McCandless added a comment - Maybe this should be provides as a separate sub-isse (or top-level issue), because I cannot apply patches to core. Mike, can you do this, when we commit this? It's fine to include these changes in this patch – I can commit them all at once. But as it is needed to generate a TokenStream instance for every numeric value, the GC cost is about the same for new and old API. Especially because each TokenStream creates a LinkedHashMap internally for the attributes. Hmm, we should do some perf tests to see how big a deal this turns out to be. It'd be nice to get some sort of reuse API working if performance is really hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd presumably have to key on thread & field name. If you do this then probably a shortcut helper method should be the preferred way. Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream? Each with a different TokenStream instance, right? Yes, this should be fine; the tokens are "logically" concatenated just like multi-valued String fields.
          Hide
          Uwe Schindler added a comment -

          Hmm, we should do some perf tests to see how big a deal this turns out to be. It'd be nice to get some sort of reuse API working if performance is really hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd presumably have to key on thread & field name. If you do this then probably a shortcut helper method should be the preferred way.

          We can also leave this to the implementor: If somebody indexes thousands of documents, he could reuse one instance of the TokenStream for each document. As the instance is only read on document addition, he must provide a separate instance for each field, but can reuse it for the next document. This is the same like reusing Field instances during indexing.

          I can add a setValue() method to the tokenStream that resets it with the new value. So one could use one instance and always use setValue() to supply a new value for each document. The precisionStep should not be modifiable.

          Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream?

          Each with a different TokenStream instance, right? Yes, this should be fine; the tokens are "logically" concatenated just like multi-valued String fields.

          Yes, sure

          Show
          Uwe Schindler added a comment - Hmm, we should do some perf tests to see how big a deal this turns out to be. It'd be nice to get some sort of reuse API working if performance is really hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd presumably have to key on thread & field name. If you do this then probably a shortcut helper method should be the preferred way. We can also leave this to the implementor: If somebody indexes thousands of documents, he could reuse one instance of the TokenStream for each document. As the instance is only read on document addition, he must provide a separate instance for each field, but can reuse it for the next document. This is the same like reusing Field instances during indexing. I can add a setValue() method to the tokenStream that resets it with the new value. So one could use one instance and always use setValue() to supply a new value for each document. The precisionStep should not be modifiable. Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream? Each with a different TokenStream instance, right? Yes, this should be fine; the tokens are "logically" concatenated just like multi-valued String fields. Yes, sure
          Hide
          Michael McCandless added a comment -

          I can add a setValue() method to the tokenStream that resets it with the new value.

          That's a good step forward, but it'd likely mean the default is to be slower performance? In general I prefer (when realistic) to have default ootb experience to be good performance, but in this case it doesn't seem like there's an easy way to have a natural high-performance default. And eg we don't reuse Document & Field by default, so expecting someone to do a bit of work to reuse Trie's TokenStreams seems OK.

          It's almost like.... Analyzer.reusableTokenStream(...) should "know" it's deailing with a Numeric field, and handle the reuse for you, in a future world when Lucene knows that a Field is a NumericField, meant to be indexed using trie. But we can leave all of that for future optimization; for now, providing setValue is great.

          Show
          Michael McCandless added a comment - I can add a setValue() method to the tokenStream that resets it with the new value. That's a good step forward, but it'd likely mean the default is to be slower performance? In general I prefer (when realistic) to have default ootb experience to be good performance, but in this case it doesn't seem like there's an easy way to have a natural high-performance default. And eg we don't reuse Document & Field by default, so expecting someone to do a bit of work to reuse Trie's TokenStreams seems OK. It's almost like.... Analyzer.reusableTokenStream(...) should "know" it's deailing with a Numeric field, and handle the reuse for you, in a future world when Lucene knows that a Field is a NumericField, meant to be indexed using trie. But we can leave all of that for future optimization; for now, providing setValue is great.
          Hide
          Uwe Schindler added a comment -

          Updated patch:

          • supports a setValue() to reset the TokenStream with a new value for reuse (as discussed before)
          • completed JavaDocs
          • remove dead code parts
          • small change in RangeBuilder API (unneeded parameters)

          The difference between reusing fields and tokenstreams and always creating a new one is measureable (I compared in the test case), but not significant. The JavaDocs contain infos, how to reuse.

          I have done everything what i planned, now its time to discuss the change.

          Show
          Uwe Schindler added a comment - Updated patch: supports a setValue() to reset the TokenStream with a new value for reuse (as discussed before) completed JavaDocs remove dead code parts small change in RangeBuilder API (unneeded parameters) The difference between reusing fields and tokenstreams and always creating a new one is measureable (I compared in the test case), but not significant. The JavaDocs contain infos, how to reuse. I have done everything what i planned, now its time to discuss the change.
          Hide
          Uwe Schindler added a comment -

          Some updates in patch, mostly typos, unneeded imports,...
          One Change: prefixCodedTo...() now accepts CharSequence instead of String (because only this interface's methods are needed for decoding).

          Show
          Uwe Schindler added a comment - Some updates in patch, mostly typos, unneeded imports,... One Change: prefixCodedTo...() now accepts CharSequence instead of String (because only this interface's methods are needed for decoding).
          Hide
          Uwe Schindler added a comment -

          New patch. In my opinion, it is now stable.

          New features/changes:

          • Attribute "ShiftAttribute" for the new TokenStream API. This makes it possible t write consumers of the TokenStream that maybe index the values to different fields depending on the shift value. This only works with the new API (as the old Token does not have a field for that).
          • Tests for the TokenStreams
          • Missing initialization of Token in old TokenStream API
          • reverted CharSequence for prefix decoder to String (performance was 5% worse during FieldCache filling)

          I think, it is ready for commit. I did some further performance tests with a index of 10 Mio indexed trie values:

          • The speed difference between reusing the token streams is marginal, maximum 10% improvement
          • Filling the FieldCache is really fast now, the use of CharSequence was a bad idea (nicer API-wise but not for performance - the well known Java-Interface problem)

          I did some statistics on this large index: The avg. number of terms for RangeFilters is 450 for 8bit and 70 for 4bit. This is exactly the same I have seen with 10000 docs in the test cases and 500000 docs in our PANGAEA index. This verifies, that the numbero of terms is not related to index size, only related to precision step.

          I will do some further speed tests comparing the prefix-encoded FieldCache with the conventional int cache using Integer.parseInt(). I suspect a big improvement, because of the simple encoding.

          I will also compare the indexing time with the old API and the new tokenizers.

          Mike: If you think, the changes in FieldCache are OK, can you commit only the changes to the FieldCache?

          Show
          Uwe Schindler added a comment - New patch. In my opinion, it is now stable. New features/changes: Attribute "ShiftAttribute" for the new TokenStream API. This makes it possible t write consumers of the TokenStream that maybe index the values to different fields depending on the shift value. This only works with the new API (as the old Token does not have a field for that). Tests for the TokenStreams Missing initialization of Token in old TokenStream API reverted CharSequence for prefix decoder to String (performance was 5% worse during FieldCache filling) I think, it is ready for commit. I did some further performance tests with a index of 10 Mio indexed trie values: The speed difference between reusing the token streams is marginal, maximum 10% improvement Filling the FieldCache is really fast now, the use of CharSequence was a bad idea (nicer API-wise but not for performance - the well known Java-Interface problem) I did some statistics on this large index: The avg. number of terms for RangeFilters is 450 for 8bit and 70 for 4bit. This is exactly the same I have seen with 10000 docs in the test cases and 500000 docs in our PANGAEA index. This verifies, that the numbero of terms is not related to index size, only related to precision step. I will do some further speed tests comparing the prefix-encoded FieldCache with the conventional int cache using Integer.parseInt(). I suspect a big improvement, because of the simple encoding. I will also compare the indexing time with the old API and the new tokenizers. Mike: If you think, the changes in FieldCache are OK, can you commit only the changes to the FieldCache?
          Hide
          Michael McCandless added a comment -

          OK the changes to FieldCache look OK – I'll commit shortly. I'll tone back the javadoc to a Expert/non-back-compat warning. It doesn't matter much since with LUCENE-831, we should be able to remove it entirely, before releasing 2.9.

          Show
          Michael McCandless added a comment - OK the changes to FieldCache look OK – I'll commit shortly. I'll tone back the javadoc to a Expert/non-back-compat warning. It doesn't matter much since with LUCENE-831 , we should be able to remove it entirely, before releasing 2.9.
          Hide
          Michael McCandless added a comment -

          OK I committed the FieldCache part... thanks Uwe!

          Show
          Michael McCandless added a comment - OK I committed the FieldCache part... thanks Uwe!
          Hide
          Uwe Schindler added a comment -

          Thanks, i will then go forward with this.
          Finally: Let's go on with 831...

          Show
          Uwe Schindler added a comment - Thanks, i will then go forward with this. Finally: Let's go on with 831...
          Hide
          Uwe Schindler added a comment -

          Committed Revision: 762710
          I only added term number statistics in the filter tests.

          Show
          Uwe Schindler added a comment - Committed Revision: 762710 I only added term number statistics in the filter tests.
          Hide
          Michael McCandless added a comment -

          b.q Finally: Let's go on with 831...

          Here here!

          Show
          Michael McCandless added a comment - b.q Finally: Let's go on with 831... Here here!

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development