Lucene - Core
  1. Lucene - Core
  2. LUCENE-7027

NumericTermAttribute throws IAE after NumericTokenStream is exhausted

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 5.5, 6.0
    • Fix Version/s: 5.5, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This small test:

        public void testCloneFullPrecisionToken() throws Exception {
          FieldType fieldType = new FieldType(IntField.TYPE_NOT_STORED);
          fieldType.setNumericPrecisionStep(Integer.MAX_VALUE);
          Field field = new IntField("field", 17, fieldType);
          TokenStream tokenStream = new CachingTokenFilter(field.tokenStream(null, null));
          assertTrue(tokenStream.incrementToken());
        }
      

      hits this unexpected exception:

      There was 1 failure:
      1) testCloneFullPrecisionToken(org.apache.lucene.analysis.TestNumericTokenStream)
      java.lang.IllegalArgumentException: Illegal shift value, must be 0..31; got shift=2147483647
      	at __randomizedtesting.SeedInfo.seed([2E1E93EF810CB5F7:EF1304A849574BC7]:0)
      	at org.apache.lucene.util.NumericUtils.intToPrefixCodedBytes(NumericUtils.java:175)
      	at org.apache.lucene.util.NumericUtils.intToPrefixCoded(NumericUtils.java:133)
      	at org.apache.lucene.analysis.NumericTokenStream$NumericTermAttributeImpl.getBytesRef(NumericTokenStream.java:165)
      	at org.apache.lucene.analysis.NumericTokenStream$NumericTermAttributeImpl.clone(NumericTokenStream.java:217)
      	at org.apache.lucene.analysis.NumericTokenStream$NumericTermAttributeImpl.clone(NumericTokenStream.java:148)
      	at org.apache.lucene.util.AttributeSource$State.clone(AttributeSource.java:55)
      	at org.apache.lucene.util.AttributeSource.captureState(AttributeSource.java:288)
      	at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:96)
      	at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:70)
      	at org.apache.lucene.analysis.TestNumericTokenStream.testCloneFullPrecisionToken(TestNumericTokenStream.java:138)
      

      because CachingTokenFilter expects that it can captureState after calling end but NumericTokenStream gets angry about this.

      1. LUCENE-7027-master.patch
        5 kB
        Uwe Schindler
      2. LUCENE-7027-master.patch
        4 kB
        Uwe Schindler
      3. LUCENE-7027-master.patch
        2 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          This is the Elasticsearch issue where we are hitting this: https://github.com/elastic/elasticsearch/pull/16643

          Thanks to Lee Hinman for uncovering this!

          Show
          Michael McCandless added a comment - This is the Elasticsearch issue where we are hitting this: https://github.com/elastic/elasticsearch/pull/16643 Thanks to Lee Hinman for uncovering this!
          Hide
          Uwe Schindler added a comment -

          CachingTokenFilter is a cool test. I can try to fix this. NumericTokenStream was originally thought as an implementation detail only called from DocumentsWriter, never wrapped by any filters

          I am sure, I can fix this.

          Show
          Uwe Schindler added a comment - CachingTokenFilter is a cool test. I can try to fix this. NumericTokenStream was originally thought as an implementation detail only called from DocumentsWriter, never wrapped by any filters I am sure, I can fix this.
          Hide
          Uwe Schindler added a comment -

          The issue here is: After the tokenstream is exhausted (which is the case when you call end()), the attribute cannot be cloned anymore, because the call to getBytesRef() fails with IllegalArgument (which is a bug). We should work around that and just return empty bytesRef or like that.

          Show
          Uwe Schindler added a comment - The issue here is: After the tokenstream is exhausted (which is the case when you call end()), the attribute cannot be cloned anymore, because the call to getBytesRef() fails with IllegalArgument (which is a bug). We should work around that and just return empty bytesRef or like that.
          Hide
          Uwe Schindler added a comment -

          Alternative test that fails:

            public void testCaptureStateAfterExhausted() throws Exception {
              try (LegacyNumericTokenStream stream=new LegacyNumericTokenStream().setIntValue(ivalue)) {
                stream.reset();
                while (stream.incrementToken());
                stream.captureState();
                stream.end();
                stream.captureState();
              }
            }
          

          (this one does not use CachingTokenFilter, which is not part of Lucene core).

          Show
          Uwe Schindler added a comment - Alternative test that fails: public void testCaptureStateAfterExhausted() throws Exception { try (LegacyNumericTokenStream stream= new LegacyNumericTokenStream().setIntValue(ivalue)) { stream.reset(); while (stream.incrementToken()); stream.captureState(); stream.end(); stream.captureState(); } } (this one does not use CachingTokenFilter, which is not part of Lucene core).
          Hide
          Uwe Schindler added a comment -

          Patch for master. I am not sure if this affects performance because we have the extra check on getBytes().

          Show
          Uwe Schindler added a comment - Patch for master. I am not sure if this affects performance because we have the extra check on getBytes().
          Hide
          Uwe Schindler added a comment - - edited

          The cause is known and the issue reported here is just some broken usage of NumericTokenStream in user code that looks totally broken anyways, I tend to not fix it and I'd simply close this bug as won't fix.

          NumericTokenStream shouldn't have been public from the beginning... I have no idea why Elasticsearch uses the class at all! NRQ's field is just implemented as a TokenStream to allow DocumentsWriter to consume it, but it should have been hidden from beginning behind NumericField. Solr does not use it anymore - we removed it a while ago.

          In addition, NumericTokenStream was never written to be wrapped by TokenFilters

          Show
          Uwe Schindler added a comment - - edited The cause is known and the issue reported here is just some broken usage of NumericTokenStream in user code that looks totally broken anyways, I tend to not fix it and I'd simply close this bug as won't fix. NumericTokenStream shouldn't have been public from the beginning... I have no idea why Elasticsearch uses the class at all! NRQ's field is just implemented as a TokenStream to allow DocumentsWriter to consume it, but it should have been hidden from beginning behind NumericField. Solr does not use it anymore - we removed it a while ago. In addition, NumericTokenStream was never written to be wrapped by TokenFilters
          Hide
          Michael McCandless added a comment -

          The cause is known and the issue reported here is just some broken usage of NumericTokenStream in user code that looks totally broken anyways,

          Hmm, I don't think this usage is broken? The public TokenStream API has a contract and the public NumericTokenStream fails to implement it properly

          Or maybe you are saying one can never call captureState after calling end? But then how does one hang onto the "final" offset and posInc? And if so, I guess we should fix MockTokenizer to enforce this? Yet, CachingTokenFilter is doing exactly this, so maybe the bug is there?

          Our analysis APIs are way too complex! They are like a 747 cockpit. We can't even agree on which buttons you are allowed to press, when.

          I have no idea why Elasticsearch uses the class at all!

          Well, Elasticsearch's simple query parser is just trying to create the right "equals" query from the incoming text, for a numeric field. Here's the query:

          {"simple_query_string":{"query":"123","fields":["foo","bar"]}}
          

          It does this today by using the appropriate tokenizer given the field type, and for numerics that's supposed to be NumericTokenStream. But then we hit this bug, seen in ES originally at https://github.com/elastic/elasticsearch/issues/16577

          Yes, ES can work around this bug if we leave Lucene buggy, which is exactly what that PR is doing, by "special casing" numerics by explicitly creating a TermQuery using the full precision numeric term, instead of trusting its type-specific tokenizer to work correctly.

          Because of the corner case and possible performance corner cases, I'd tend to close this as won't fix.

          Hmm I think correctness trumps performance?

          Show
          Michael McCandless added a comment - The cause is known and the issue reported here is just some broken usage of NumericTokenStream in user code that looks totally broken anyways, Hmm, I don't think this usage is broken? The public TokenStream API has a contract and the public NumericTokenStream fails to implement it properly Or maybe you are saying one can never call captureState after calling end ? But then how does one hang onto the "final" offset and posInc? And if so, I guess we should fix MockTokenizer to enforce this? Yet, CachingTokenFilter is doing exactly this, so maybe the bug is there? Our analysis APIs are way too complex! They are like a 747 cockpit. We can't even agree on which buttons you are allowed to press, when. I have no idea why Elasticsearch uses the class at all! Well, Elasticsearch's simple query parser is just trying to create the right "equals" query from the incoming text, for a numeric field. Here's the query: {"simple_query_string":{"query":"123","fields":["foo","bar"]}} It does this today by using the appropriate tokenizer given the field type, and for numerics that's supposed to be NumericTokenStream . But then we hit this bug, seen in ES originally at https://github.com/elastic/elasticsearch/issues/16577 Yes, ES can work around this bug if we leave Lucene buggy, which is exactly what that PR is doing, by "special casing" numerics by explicitly creating a TermQuery using the full precision numeric term, instead of trusting its type-specific tokenizer to work correctly. Because of the corner case and possible performance corner cases, I'd tend to close this as won't fix. Hmm I think correctness trumps performance?
          Hide
          Uwe Schindler added a comment -

          I have no problem committing the given patch. It allows to get an empty token attribute after the tokenstream is exhausted. It is the only correct way to fix it.

          I just mentioned this here to think about it (if we really need this). Bt I agree with you:

          Hmm I think correctness trumps performance?

          Or maybe you are saying one can never call captureState after calling end? But then how does one hang onto the "final" offset and posInc? And if so, I guess we should fix MockTokenizer to enforce this? Yet, CachingTokenFilter is doing exactly this, so maybe the bug is there?

          That must be allowed, otherwise the whole thing like caching won't work. In addition, it is not only captueState that's broken, every thing done with TermAttribute leads to the IllegalArgEx, So it is not the tokenstream thats buggy its just the attribute implementation (and this is what the bug is fixing).

          Well, Elasticsearch's simple query parser is just trying to create the right "equals" query from the incoming text, for a numeric field.

          OK, I was not aware of that. The problem is that the query parser uses the CachingTokenFilter, I assume. right?

          Show
          Uwe Schindler added a comment - I have no problem committing the given patch. It allows to get an empty token attribute after the tokenstream is exhausted. It is the only correct way to fix it. I just mentioned this here to think about it (if we really need this). Bt I agree with you: Hmm I think correctness trumps performance? Or maybe you are saying one can never call captureState after calling end? But then how does one hang onto the "final" offset and posInc? And if so, I guess we should fix MockTokenizer to enforce this? Yet, CachingTokenFilter is doing exactly this, so maybe the bug is there? That must be allowed, otherwise the whole thing like caching won't work. In addition, it is not only captueState that's broken, every thing done with TermAttribute leads to the IllegalArgEx, So it is not the tokenstream thats buggy its just the attribute implementation (and this is what the bug is fixing). Well, Elasticsearch's simple query parser is just trying to create the right "equals" query from the incoming text, for a numeric field. OK, I was not aware of that. The problem is that the query parser uses the CachingTokenFilter, I assume. right?
          Hide
          Uwe Schindler added a comment - - edited

          OK, I will commit this to "master". Should I backport to 5.5? I can backport to branch_5x, but thats useless if we don't put it into 5.5. I will just add another test to verify that also restoreState works

          Show
          Uwe Schindler added a comment - - edited OK, I will commit this to "master". Should I backport to 5.5? I can backport to branch_5x, but thats useless if we don't put it into 5.5. I will just add another test to verify that also restoreState works
          Hide
          Uwe Schindler added a comment -

          New patch with more tests (especially cloning)

          Show
          Uwe Schindler added a comment - New patch with more tests (especially cloning)
          Hide
          Uwe Schindler added a comment -

          Improved test to use random value throughout whole suite (unrelated to this issue, but still good thing to do).

          Show
          Uwe Schindler added a comment - Improved test to use random value throughout whole suite (unrelated to this issue, but still good thing to do).
          Hide
          Michael McCandless added a comment -

          Thanks Uwe Schindler! Patch looks good to me!

          The problem is that the query parser uses the CachingTokenFilter, I assume. right?

          Yeah ... it's SimpleQueryParser.java in ES, and it seems to be poached maybe from AnalyzerQueryNodeProcessor.java in Lucene's flexible query parser, where it needs to make two passes over the tokens the tokenizer created from the text, I think?

          Should I backport to 5.5?

          +1, I'll mark this blocker for 5.5, and cancel the current RC1 and respin.

          Thanks Uwe Schindler!

          Show
          Michael McCandless added a comment - Thanks Uwe Schindler ! Patch looks good to me! The problem is that the query parser uses the CachingTokenFilter, I assume. right? Yeah ... it's SimpleQueryParser.java in ES, and it seems to be poached maybe from AnalyzerQueryNodeProcessor.java in Lucene's flexible query parser, where it needs to make two passes over the tokens the tokenizer created from the text, I think? Should I backport to 5.5? +1, I'll mark this blocker for 5.5, and cancel the current RC1 and respin. Thanks Uwe Schindler !
          Hide
          ASF subversion and git services added a comment -

          Commit 42ae21cb9ae36dde4d5ffda16e06976b369b95e4 in lucene-solr's branch refs/heads/master from Uwe Schindler
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42ae21c ]

          LUCENE-7027: Fixed NumericTermAttribute to not throw IllegalArgumentException after NumericTokenStream was exhausted

          Show
          ASF subversion and git services added a comment - Commit 42ae21cb9ae36dde4d5ffda16e06976b369b95e4 in lucene-solr's branch refs/heads/master from Uwe Schindler [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42ae21c ] LUCENE-7027 : Fixed NumericTermAttribute to not throw IllegalArgumentException after NumericTokenStream was exhausted
          Hide
          ASF subversion and git services added a comment -

          Commit 7abc0c6de6ae51f0b3b805c8d9d83185e78bff69 in lucene-solr's branch refs/heads/branch_5x from Uwe Schindler
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7abc0c6 ]

          LUCENE-7027: Fixed NumericTermAttribute to not throw IllegalArgumentException after NumericTokenStream was exhausted

          1. Conflicts:
          2. lucene/core/src/java/org/apache/lucene/analysis/NumericTokenStream.java
          3. lucene/core/src/test/org/apache/lucene/analysis/TestNumericTokenStream.java
          Show
          ASF subversion and git services added a comment - Commit 7abc0c6de6ae51f0b3b805c8d9d83185e78bff69 in lucene-solr's branch refs/heads/branch_5x from Uwe Schindler [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=7abc0c6 ] LUCENE-7027 : Fixed NumericTermAttribute to not throw IllegalArgumentException after NumericTokenStream was exhausted Conflicts: lucene/core/src/java/org/apache/lucene/analysis/NumericTokenStream.java lucene/core/src/test/org/apache/lucene/analysis/TestNumericTokenStream.java
          Hide
          ASF subversion and git services added a comment -

          Commit d83f57901f38334c103678b985cf875a9ef51013 in lucene-solr's branch refs/heads/branch_5_5 from Uwe Schindler
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d83f579 ]

          LUCENE-7027: Fixed NumericTermAttribute to not throw IllegalArgumentException after NumericTokenStream was exhausted

          1. Conflicts:
          2. lucene/core/src/java/org/apache/lucene/analysis/NumericTokenStream.java
          3. lucene/core/src/test/org/apache/lucene/analysis/TestNumericTokenStream.java
          Show
          ASF subversion and git services added a comment - Commit d83f57901f38334c103678b985cf875a9ef51013 in lucene-solr's branch refs/heads/branch_5_5 from Uwe Schindler [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d83f579 ] LUCENE-7027 : Fixed NumericTermAttribute to not throw IllegalArgumentException after NumericTokenStream was exhausted Conflicts: lucene/core/src/java/org/apache/lucene/analysis/NumericTokenStream.java lucene/core/src/test/org/apache/lucene/analysis/TestNumericTokenStream.java
          Hide
          Uwe Schindler added a comment -

          Thanks Mike & Lee!

          Show
          Uwe Schindler added a comment - Thanks Mike & Lee!
          Hide
          ASF subversion and git services added a comment -

          Commit 40f15765ba7605d1b5f3897fc3f97f2848635d77 in lucene-solr's branch refs/heads/branch_5_5 from Uwe Schindler
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=40f1576 ]

          LUCENE-7027: Fix accidental deprecation of test

          Show
          ASF subversion and git services added a comment - Commit 40f15765ba7605d1b5f3897fc3f97f2848635d77 in lucene-solr's branch refs/heads/branch_5_5 from Uwe Schindler [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=40f1576 ] LUCENE-7027 : Fix accidental deprecation of test
          Hide
          ASF subversion and git services added a comment -

          Commit aaa05e20e789afca2ebc8443b248d512318cd201 in lucene-solr's branch refs/heads/branch_5x from Uwe Schindler
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aaa05e2 ]

          LUCENE-7027: Fix accidental deprecation of test

          Show
          ASF subversion and git services added a comment - Commit aaa05e20e789afca2ebc8443b248d512318cd201 in lucene-solr's branch refs/heads/branch_5x from Uwe Schindler [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=aaa05e2 ] LUCENE-7027 : Fix accidental deprecation of test
          Hide
          Michael McCandless added a comment -

          Thanks Uwe Schindler!

          Show
          Michael McCandless added a comment - Thanks Uwe Schindler !
          Hide
          Lee Hinman added a comment -

          Thanks Uwe Schindler!

          Show
          Lee Hinman added a comment - Thanks Uwe Schindler !

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development