Lucene - Core
  1. Lucene - Core
  2. LUCENE-4656

Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream (without CharTermAttribute), fix BaseTokenStreamTestCase

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TestRandomChains can fail because EmptyTokenizer doesn't have a CharTermAttribute and doesn't compute the end offset (if the offset attribute was added by a filter).

      1. LUCENE-4656_bttc.patch
        3 kB
        Robert Muir
      2. LUCENE-4656.patch
        21 kB
        Uwe Schindler
      3. LUCENE-4656.patch
        13 kB
        Uwe Schindler
      4. LUCENE-4656.patch
        13 kB
        Uwe Schindler
      5. LUCENE-4656.patch
        13 kB
        Uwe Schindler
      6. LUCENE-4656.patch
        5 kB
        Adrien Grand
      7. LUCENE-4656.patch
        2 kB
        Adrien Grand
      8. LUCENE-4656-IW-bug.patch
        2 kB
        Uwe Schindler
      9. LUCENE-4656-IW-fix.patch
        9 kB
        Uwe Schindler
      10. LUCENE-4656-IW-fix.patch
        9 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          I think this might be a broken assert in BaseTokenStreamTestCase?

          assertTrue("has no CharTermAttribute", ts.hasAttribute(CharTermAttribute.class));

          Why do we have that? This assert can be safely removed i think.

          Show
          Robert Muir added a comment - I think this might be a broken assert in BaseTokenStreamTestCase? assertTrue("has no CharTermAttribute", ts.hasAttribute(CharTermAttribute.class)); Why do we have that? This assert can be safely removed i think.
          Hide
          Adrien Grand added a comment -

          Patch. I wasn't sure whether to add a CharTermAttribute to EmptyTokenizer or to try fixing BaseTokenStreamTestCase but I couldn't think of a non-trivial tokenizer that wouldn't have a CharTermAttribute so I left the assertion that checks that a token stream always has a CharTermAttribute.

          Show
          Adrien Grand added a comment - Patch. I wasn't sure whether to add a CharTermAttribute to EmptyTokenizer or to try fixing BaseTokenStreamTestCase but I couldn't think of a non-trivial tokenizer that wouldn't have a CharTermAttribute so I left the assertion that checks that a token stream always has a CharTermAttribute.
          Hide
          Adrien Grand added a comment -

          Why do we have that?

          It feels strange to me that a non-trivial TokenStream could have no CharTermAttribute?

          Show
          Adrien Grand added a comment - Why do we have that? It feels strange to me that a non-trivial TokenStream could have no CharTermAttribute?
          Hide
          Robert Muir added a comment -

          I think we should fix BaseTokenStreamTestCase but your patch fixes other things about this Tokenizer?

          Like it fixes its end() to work and makes it consume from its reader.

          Show
          Robert Muir added a comment - I think we should fix BaseTokenStreamTestCase but your patch fixes other things about this Tokenizer? Like it fixes its end() to work and makes it consume from its reader.
          Hide
          Robert Muir added a comment -

          It feels strange to me that a non-trivial TokenStream could have no CharTermAttribute?

          I dont think analysis components should add attributes they dont really need: it should only
          be the minimal ones they need to work. so I think that extends to this empty one?

          Show
          Robert Muir added a comment - It feels strange to me that a non-trivial TokenStream could have no CharTermAttribute? I dont think analysis components should add attributes they dont really need: it should only be the minimal ones they need to work. so I think that extends to this empty one?
          Hide
          Uwe Schindler added a comment -

          I am happy with both solutions. I agree, that only the really used attributes should be added. There are other tokenstream that dont have a CTA, e.g. NumericTokenStream. For the indexer, only BytesRefAttribute is mandatory (I am not sure it is really mandatory, but it is the only one that it is consumed to get the term text).

          Show
          Uwe Schindler added a comment - I am happy with both solutions. I agree, that only the really used attributes should be added. There are other tokenstream that dont have a CTA, e.g. NumericTokenStream. For the indexer, only BytesRefAttribute is mandatory (I am not sure it is really mandatory , but it is the only one that it is consumed to get the term text).
          Hide
          Uwe Schindler added a comment -

          We should at least test EmptyTokenizer with IndexWrite.

          By the way: EmptyTokenizer is wrong name, as it needs no input reader. It should only extend TokenStream. That's the real bug here!

          Show
          Uwe Schindler added a comment - We should at least test EmptyTokenizer with IndexWrite. By the way: EmptyTokenizer is wrong name, as it needs no input reader. It should only extend TokenStream. That's the real bug here!
          Hide
          Robert Muir added a comment -

          There is an EmptyTokenStream in src/java that is correct.

          EmptyTokenizer is in test-framework, for some crazy test reason. If its not being used lets remove it!

          Show
          Robert Muir added a comment - There is an EmptyTokenStream in src/java that is correct. EmptyTokenizer is in test-framework, for some crazy test reason. If its not being used lets remove it!
          Hide
          Adrien Grand added a comment -

          Alternative patch that fixes BaseTokenStreamTestCase. I needed to add a quick hack to add a TermToBytesRefAttribute when the tokenstream doesn't have one so that TermsHashPerField doesn't complain that it can't find this attribute when indexing.

          Show
          Adrien Grand added a comment - Alternative patch that fixes BaseTokenStreamTestCase. I needed to add a quick hack to add a TermToBytesRefAttribute when the tokenstream doesn't have one so that TermsHashPerField doesn't complain that it can't find this attribute when indexing.
          Hide
          Uwe Schindler added a comment -

          I needed to add a quick hack to add a TermToBytesRefAttribute when the tokenstream doesn't have one so that TermsHashPerField doesn't complain that it can't find this attribute when indexing.

          But this is a bug. If you use the non-test EmptyTokenStream from the analysis-common package and add it to a document it will fail, right? So we should fix IndexWriter to handle that case?

          Show
          Uwe Schindler added a comment - I needed to add a quick hack to add a TermToBytesRefAttribute when the tokenstream doesn't have one so that TermsHashPerField doesn't complain that it can't find this attribute when indexing. But this is a bug. If you use the non-test EmptyTokenStream from the analysis-common package and add it to a document it will fail, right? So we should fix IndexWriter to handle that case?
          Hide
          Adrien Grand added a comment -

          So we should fix IndexWriter to handle that case?

          How would IndexWriter handle token streams with no TermToBytesRefAttribute?

          • fail if the tokens stream happens to have tokens? (incrementToken returns true at least once)
          • index empty terms?
          Show
          Adrien Grand added a comment - So we should fix IndexWriter to handle that case? How would IndexWriter handle token streams with no TermToBytesRefAttribute? fail if the tokens stream happens to have tokens? (incrementToken returns true at least once) index empty terms?
          Hide
          Robert Muir added a comment -

          I'm not really certain what it should do.

          it does seem a little funky that even though incrementToken returned false we start() consumer anyway though.

          Show
          Robert Muir added a comment - I'm not really certain what it should do. it does seem a little funky that even though incrementToken returned false we start() consumer anyway though.
          Hide
          Uwe Schindler added a comment -

          The problem is here that the attributes are initialized after construction of the Tokenizer before the consumer starts to consume the tokens. The bug in IndexWriter is that it fails, when the initial getAttribute fails. Maybe it should just initialize the bytesRef attribute to be NULL and fail later if really tokens are emitted.

          Lucene 3.x indexed empty terms.

          Show
          Uwe Schindler added a comment - The problem is here that the attributes are initialized after construction of the Tokenizer before the consumer starts to consume the tokens. The bug in IndexWriter is that it fails, when the initial getAttribute fails. Maybe it should just initialize the bytesRef attribute to be NULL and fail later if really tokens are emitted. Lucene 3.x indexed empty terms.
          Hide
          Uwe Schindler added a comment - - edited

          Here a patch showing the bug in the public class EmptyTokenStream from analysis-common working together with IndexWriter:

          [junit4:junit4] ERROR   0.33s | TestEmptyTokenStream.testIndexWriter_LUCENE4656 <<<
          [junit4:junit4]    > Throwable #1: java.lang.IllegalArgumentException: This AttributeSource does not have the attribute 'org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute'.
          [junit4:junit4]    >    at __randomizedtesting.SeedInfo.seed([3B209861053849AF:D7B239E3D4067832]:0)
          [junit4:junit4]    >    at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:303)
          [junit4:junit4]    >    at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:119)
          [junit4:junit4]    >    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:109)
          [junit4:junit4]    >    at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:272)
          [junit4:junit4]    >    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
          [junit4:junit4]    >    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376)
          [junit4:junit4]    >    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1456)
          [junit4:junit4]    >    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1131)
          [junit4:junit4]    >    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1112)
          [junit4:junit4]    >    at org.apache.lucene.analysis.miscellaneous.TestEmptyTokenStream.testIndexWriter_LUCENE4656(TestEmptyTokenSt

          It also has a test that assertTokenStreamContents actually works, which it doesnt at the moment, because it asserts that the CTA is available. But NumericTokenStream and this one both dont have this attribute.

          Show
          Uwe Schindler added a comment - - edited Here a patch showing the bug in the public class EmptyTokenStream from analysis-common working together with IndexWriter: [junit4:junit4] ERROR 0.33s | TestEmptyTokenStream.testIndexWriter_LUCENE4656 <<< [junit4:junit4] > Throwable #1: java.lang.IllegalArgumentException: This AttributeSource does not have the attribute 'org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute'. [junit4:junit4] > at __randomizedtesting.SeedInfo.seed([3B209861053849AF:D7B239E3D4067832]:0) [junit4:junit4] > at org.apache.lucene.util.AttributeSource.getAttribute(AttributeSource.java:303) [junit4:junit4] > at org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:119) [junit4:junit4] > at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:109) [junit4:junit4] > at org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:272) [junit4:junit4] > at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250) [junit4:junit4] > at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:376) [junit4:junit4] > at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1456) [junit4:junit4] > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1131) [junit4:junit4] > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1112) [junit4:junit4] > at org.apache.lucene.analysis.miscellaneous.TestEmptyTokenStream.testIndexWriter_LUCENE4656(TestEmptyTokenSt It also has a test that assertTokenStreamContents actually works, which it doesnt at the moment, because it asserts that the CTA is available. But NumericTokenStream and this one both dont have this attribute.
          Hide
          Uwe Schindler added a comment -

          Here the fix that solves the DocInverterPerField issue (it also removes the horrible for(; loop where the first clause is a "if ... break".

          Now only BaseTokenStreamTestCase should be able to handle the missing attribute. It should only complain when actually tokens are emitted.

          Show
          Uwe Schindler added a comment - Here the fix that solves the DocInverterPerField issue (it also removes the horrible for(; loop where the first clause is a "if ... break". Now only BaseTokenStreamTestCase should be able to handle the missing attribute. It should only complain when actually tokens are emitted.
          Hide
          Uwe Schindler added a comment -

          Better patch, ueses do...while, which is more readable.

          Show
          Uwe Schindler added a comment - Better patch, ueses do...while, which is more readable.
          Hide
          Robert Muir added a comment -

          here's a patch for BaseTokenStreamTestCase. I think it should work for this EmptyTokenizer too.

          Show
          Robert Muir added a comment - here's a patch for BaseTokenStreamTestCase. I think it should work for this EmptyTokenizer too.
          Hide
          Uwe Schindler added a comment -

          New patch merged with Adrien's. I am not sure if the Fix in BaseTokenStreamTestCase is correct, because if you pass the String[] you expect tokens and the fix is different like the one for offsets or positionincrements.

          Show
          Uwe Schindler added a comment - New patch merged with Adrien's. I am not sure if the Fix in BaseTokenStreamTestCase is correct, because if you pass the String[] you expect tokens and the fix is different like the one for offsets or positionincrements.
          Hide
          Uwe Schindler added a comment -

          Thanks Robert! I will merge again and I took the issue!

          Show
          Uwe Schindler added a comment - Thanks Robert! I will merge again and I took the issue!
          Hide
          Uwe Schindler added a comment -

          Patch merged with Robert's.

          Show
          Uwe Schindler added a comment - Patch merged with Robert's.
          Hide
          Uwe Schindler added a comment -

          Add a check that the document is really in IW after indexing.

          Show
          Uwe Schindler added a comment - Add a check that the document is really in IW after indexing.
          Hide
          Robert Muir added a comment -

          Slightly related to the BaseToken changes, i think its confusing how we use output.length (from the String[]) also as the number of expected tokens.

          we could clear this up with something like:

          @@ -114,21 +114,32 @@
             public static void assertTokenStreamContents(...
               assertNotNull(output);
          +    final int numExpected = output.length;
          

          and then use this in the for loop and such.

          additionally i've often sent the wrong number of parameters when writing tests because you are passing huge parallel arrays.
          so something like this could save some trouble:

               TypeAttribute typeAtt = null;
               if (types != null) {
                 assertTrue("has no TypeAttribute", ts.hasAttribute(TypeAttribute.class));
                 typeAtt = ts.getAttribute(TypeAttribute.class);
          +      assertEquals("wrong number of types", numExpected, types.length);
               }
          

          We don't have to do these changes here. it just reminded me of it looking at this stuff.

          Show
          Robert Muir added a comment - Slightly related to the BaseToken changes, i think its confusing how we use output.length (from the String[]) also as the number of expected tokens. we could clear this up with something like: @@ -114,21 +114,32 @@ public static void assertTokenStreamContents(... assertNotNull(output); + final int numExpected = output.length; and then use this in the for loop and such. additionally i've often sent the wrong number of parameters when writing tests because you are passing huge parallel arrays. so something like this could save some trouble: TypeAttribute typeAtt = null; if (types != null) { assertTrue("has no TypeAttribute", ts.hasAttribute(TypeAttribute.class)); typeAtt = ts.getAttribute(TypeAttribute.class); + assertEquals("wrong number of types", numExpected, types.length); } We don't have to do these changes here. it just reminded me of it looking at this stuff.
          Hide
          Robert Muir added a comment -

          I would also say that we dont need EmptyTokenizer in test-framework.
          Its only there because 2 places use it, and both in a bogus way (in my opinion):
          1. core/TestDocument
          2. queryparsers

          we should first fix TestDocument, its test does not care if the tokenstream is empty or anything:

          Index: src/test/org/apache/lucene/document/TestDocument.java
          ===================================================================
          --- src/test/org/apache/lucene/document/TestDocument.java	(revision 1428441)
          +++ src/test/org/apache/lucene/document/TestDocument.java	(working copy)
          @@ -20,7 +20,7 @@
           import java.io.StringReader;
           import java.util.List;
           
          -import org.apache.lucene.analysis.EmptyTokenizer;
          +import org.apache.lucene.analysis.CannedTokenStream;
           import org.apache.lucene.analysis.MockAnalyzer;
           import org.apache.lucene.index.DirectoryReader;
           import org.apache.lucene.index.IndexReader;
          @@ -318,7 +318,7 @@
             // LUCENE-3616
             public void testInvalidFields() {
               try {
          -      new Field("foo", new EmptyTokenizer(new StringReader("")), StringField.TYPE_STORED);
          +      new Field("foo", new CannedTokenStream(), StringField.TYPE_STORED);
                 fail("did not hit expected exc");
               } catch (IllegalArgumentException iae) {
                 // expected
          

          The queryparser test looks outdated, like its some test about when an Analyzer returns null?
          Maybe the test can just be removed, but if we apply this patch, we could move EmptyTokenizer
          from test-framework/src/java to queryparser/src/test at least as an improvement, since it is kinda funky.

          Show
          Robert Muir added a comment - I would also say that we dont need EmptyTokenizer in test-framework. Its only there because 2 places use it, and both in a bogus way (in my opinion): 1. core/TestDocument 2. queryparsers we should first fix TestDocument, its test does not care if the tokenstream is empty or anything: Index: src/test/org/apache/lucene/document/TestDocument.java =================================================================== --- src/test/org/apache/lucene/document/TestDocument.java (revision 1428441) +++ src/test/org/apache/lucene/document/TestDocument.java (working copy) @@ -20,7 +20,7 @@ import java.io.StringReader; import java.util.List; -import org.apache.lucene.analysis.EmptyTokenizer; +import org.apache.lucene.analysis.CannedTokenStream; import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; @@ -318,7 +318,7 @@ // LUCENE-3616 public void testInvalidFields() { try { - new Field("foo", new EmptyTokenizer(new StringReader("")), StringField.TYPE_STORED); + new Field("foo", new CannedTokenStream(), StringField.TYPE_STORED); fail("did not hit expected exc"); } catch (IllegalArgumentException iae) { // expected The queryparser test looks outdated, like its some test about when an Analyzer returns null? Maybe the test can just be removed, but if we apply this patch, we could move EmptyTokenizer from test-framework/src/java to queryparser/src/test at least as an improvement, since it is kinda funky.
          Hide
          Uwe Schindler added a comment -

          I would really like to remove that horrible piece of sh*

          Show
          Uwe Schindler added a comment - I would really like to remove that horrible piece of sh*
          Hide
          Adrien Grand added a comment -

          Uwe, I just ran all Lucene tests with your patch and they passed, so +1. +1 to removing EmptyTokenizer too.

          Show
          Adrien Grand added a comment - Uwe, I just ran all Lucene tests with your patch and they passed, so +1. +1 to removing EmptyTokenizer too.
          Hide
          Uwe Schindler added a comment -

          In trunk it is not even used in core! Only in 4.x's TestDocument!

          Show
          Uwe Schindler added a comment - In trunk it is not even used in core! Only in 4.x's TestDocument!
          Hide
          Uwe Schindler added a comment -

          Patch with Tokenizer completely removed!

          In trunk its completely useless in core (not used), in moudles/qp it is just used to supply a Tokenizer that returns no tokens at all (not null, just empty tokenstream). MockTokenizer is fine for this, too.

          Show
          Uwe Schindler added a comment - Patch with Tokenizer completely removed! In trunk its completely useless in core (not used), in moudles/qp it is just used to supply a Tokenizer that returns no tokens at all (not null, just empty tokenstream). MockTokenizer is fine for this, too.
          Hide
          Uwe Schindler added a comment -

          Sorry was wrong patch.

          Show
          Uwe Schindler added a comment - Sorry was wrong patch.
          Hide
          Uwe Schindler added a comment -

          Correct patch. TestDocument actually used it, too - sorry.
          This patch also makes the QueryParser Analyzer correctly reuse. The trick is to simply override initReader() in the Analyzer to return an empty one and close the original reader.

          Show
          Uwe Schindler added a comment - Correct patch. TestDocument actually used it, too - sorry. This patch also makes the QueryParser Analyzer correctly reuse. The trick is to simply override initReader() in the Analyzer to return an empty one and close the original reader.
          Hide
          Uwe Schindler added a comment -

          I also cleaned up some imports in affected files. I will commit this later and backport.

          Show
          Uwe Schindler added a comment - I also cleaned up some imports in affected files. I will commit this later and backport.
          Hide
          Uwe Schindler added a comment -

          I committed the latest patch. Thanks Adrien and Robert!

          Show
          Uwe Schindler added a comment - I committed the latest patch. Thanks Adrien and Robert!
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1428675

          Merged revision(s) 1428671 from lucene/dev/trunk:
          LUCENE-4656: Fix regression in IndexWriter to work with empty TokenStreams that have no TermToBytesRefAttribute (commonly provided by CharTermAttribute), e.g., oal.analysis.miscellaneous.EmptyTokenStream. Remove EmptyTokenizer from test-framework.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1428675 Merged revision(s) 1428671 from lucene/dev/trunk: LUCENE-4656 : Fix regression in IndexWriter to work with empty TokenStreams that have no TermToBytesRefAttribute (commonly provided by CharTermAttribute), e.g., oal.analysis.miscellaneous.EmptyTokenStream. Remove EmptyTokenizer from test-framework.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1428671

          LUCENE-4656: Fix regression in IndexWriter to work with empty TokenStreams that have no TermToBytesRefAttribute (commonly provided by CharTermAttribute), e.g., oal.analysis.miscellaneous.EmptyTokenStream. Remove EmptyTokenizer from test-framework.

          Show
          Commit Tag Bot added a comment - [trunk commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1428671 LUCENE-4656 : Fix regression in IndexWriter to work with empty TokenStreams that have no TermToBytesRefAttribute (commonly provided by CharTermAttribute), e.g., oal.analysis.miscellaneous.EmptyTokenStream. Remove EmptyTokenizer from test-framework.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development