Lucene - Core
  1. Lucene - Core
  2. LUCENE-1072

NullPointerException during indexing in DocumentsWriter$ThreadState$FieldData.addPosition

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Environment:

      Linux CentOS 5 x86_64 running on 2-core Pentium D, Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_01-b06, mixed mode), using lucene-core-2007-11-29_02-49-31

    • Lucene Fields:
      New

      Description

      In my case during indexing sometimes appear documents with unusually large "words" - text-encoded images in fact.
      Attempt to add document that contains field with such token produces java.lang.IllegalArgumentException:
      java.lang.IllegalArgumentException: term length 37944 exceeds max term length 16383
      at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.addPosition(DocumentsWriter.java:1492)
      at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(DocumentsWriter.java:1321)
      at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.processField(DocumentsWriter.java:1247)
      at org.apache.lucene.index.DocumentsWriter$ThreadState.processDocument(DocumentsWriter.java:972)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2202)
      at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:2186)
      at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1432)
      at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411)

      This is expected, exception is caught and ignored. The problem is that after this IndexWriter becomes somewhat corrupted and subsequent attempts to add documents to the index fail as well, this time with NPE:
      java.lang.NullPointerException
      at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.addPosition(DocumentsWriter.java:1497)
      at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(DocumentsWriter.java:1321)
      at org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.processField(DocumentsWriter.java:1247)
      at org.apache.lucene.index.DocumentsWriter$ThreadState.processDocument(DocumentsWriter.java:972)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2202)
      at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:2186)
      at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1432)
      at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411)

      This is 100% reproducible.

      1. LUCENE-1072.patch
        10 kB
        Michael McCandless
      2. LUCENE-1072.take2.patch
        11 kB
        Michael McCandless

        Activity

        Show
        Grant Ingersoll added a comment - I think this is related: https://issues.apache.org/jira/browse/LUCENE-1052 See http://www.gossamer-threads.com/lists/lucene/java-dev/54371
        Hide
        Michael McCandless added a comment -

        Attached patch. I plan to commit in a day or two.

        I added a unit test showing that indeed DocumentsWriter becomes
        unusable once it's hit a "too long term", then fixed the issue so the
        unit test passes.

        Now, if we encounter too-long terms in the doc we skip those terms but
        continue indexing the other acceptable terms from the doc, then throw
        the IllegalArgumentException at the end after processing the full
        document. So it's now "ok" to catch & ignore this exception though
        clearly in general you should address its root cause so you don't
        accidentally pollute your term dictionary (see LUCENE-1052, as Grant
        suggested, once that happens!).

        Show
        Michael McCandless added a comment - Attached patch. I plan to commit in a day or two. I added a unit test showing that indeed DocumentsWriter becomes unusable once it's hit a "too long term", then fixed the issue so the unit test passes. Now, if we encounter too-long terms in the doc we skip those terms but continue indexing the other acceptable terms from the doc, then throw the IllegalArgumentException at the end after processing the full document. So it's now "ok" to catch & ignore this exception though clearly in general you should address its root cause so you don't accidentally pollute your term dictionary (see LUCENE-1052 , as Grant suggested, once that happens!).
        Hide
        Michael McCandless added a comment -

        I just committed this. Thanks for reporting it Alexei!

        Show
        Michael McCandless added a comment - I just committed this. Thanks for reporting it Alexei!
        Hide
        Michael Busch added a comment -

        I'm seeing a similar issue when TokenStream.next() throws an
        IOException (or a RuntimeException). The DocumentsWriter is
        thereafter not usable anymore, i. e. subsequent calls of
        addDocument() fail with a NullPointerException.

        I added this test to TestIndexWriter which shows the problem:

          public void testExceptionFromTokenStream() throws IOException {
            RAMDirectory dir = new RAMDirectory();
            IndexWriter writer = new IndexWriter(dir, new Analyzer() {
        
              public TokenStream tokenStream(String fieldName, Reader reader) {
                return new TokenFilter(new StandardTokenizer(reader)) {
                  private int count = 0;
        
                  public Token next() throws IOException {
                    if (count++ == 5) {
                      throw new IOException();
                    }
                    return input.next();
                  }
                };
              }
        
            }, true);
        
            Document doc = new Document();
            String contents = "aa bb cc dd ee ff gg hh ii jj kk";
            doc.add(new Field("content", contents, Field.Store.NO,
                Field.Index.TOKENIZED));
            try {
              writer.addDocument(doc);
              fail("did not hit expected exception");
            } catch (Exception e) {
            }
        
            // Make sure we can add another normal document
            doc = new Document();
            doc.add(new Field("content", "aa bb cc dd", Field.Store.NO,
                Field.Index.TOKENIZED));
            writer.addDocument(doc);
        
            // Make sure we can add another normal document
            doc = new Document();
            doc.add(new Field("content", "aa bb cc dd", Field.Store.NO,
                Field.Index.TOKENIZED));
            writer.addDocument(doc);
        
            writer.close();
          }
        
        
        Show
        Michael Busch added a comment - I'm seeing a similar issue when TokenStream.next() throws an IOException (or a RuntimeException). The DocumentsWriter is thereafter not usable anymore, i. e. subsequent calls of addDocument() fail with a NullPointerException. I added this test to TestIndexWriter which shows the problem: public void testExceptionFromTokenStream() throws IOException { RAMDirectory dir = new RAMDirectory(); IndexWriter writer = new IndexWriter(dir, new Analyzer() { public TokenStream tokenStream( String fieldName, Reader reader) { return new TokenFilter( new StandardTokenizer(reader)) { private int count = 0; public Token next() throws IOException { if (count++ == 5) { throw new IOException(); } return input.next(); } }; } }, true ); Document doc = new Document(); String contents = "aa bb cc dd ee ff gg hh ii jj kk" ; doc.add( new Field( "content" , contents, Field.Store.NO, Field.Index.TOKENIZED)); try { writer.addDocument(doc); fail( "did not hit expected exception" ); } catch (Exception e) { } // Make sure we can add another normal document doc = new Document(); doc.add( new Field( "content" , "aa bb cc dd" , Field.Store.NO, Field.Index.TOKENIZED)); writer.addDocument(doc); // Make sure we can add another normal document doc = new Document(); doc.add( new Field( "content" , "aa bb cc dd" , Field.Store.NO, Field.Index.TOKENIZED)); writer.addDocument(doc); writer.close(); }
        Hide
        Michael McCandless added a comment -

        OK, I added that as a test case (to TestIndexWriter), and then fixed
        it. Attached patch. I plan to commit in 1 or 2 days. Thanks
        Michael!

        This was happening during DW.abort(), which was being called on an
        unhandled exception to clear all documents added since the last flush.
        It was incorrectly recycling a null Posting instance.

        I've also tightened when abort() is called to only those places that
        actually require it. A failure in the tokenization of one document
        should not discard previously indexed documents but not-yet-flushed
        documents. So I added asserts to the test case to verify that.

        Show
        Michael McCandless added a comment - OK, I added that as a test case (to TestIndexWriter), and then fixed it. Attached patch. I plan to commit in 1 or 2 days. Thanks Michael! This was happening during DW.abort(), which was being called on an unhandled exception to clear all documents added since the last flush. It was incorrectly recycling a null Posting instance. I've also tightened when abort() is called to only those places that actually require it. A failure in the tokenization of one document should not discard previously indexed documents but not-yet-flushed documents. So I added asserts to the test case to verify that.
        Hide
        Michael Busch added a comment -

        Thanks for the quick fix, Mike. All unit tests, incl. the new one, pass.

        I also added this patch to the Lucene version in our app and it works
        fine now. So even after the TokenStream throws a RuntimeException
        the DocsWriter is still usable for subsequent docs.

        +1 for committing this soon!!

        Show
        Michael Busch added a comment - Thanks for the quick fix, Mike. All unit tests, incl. the new one, pass. I also added this patch to the Lucene version in our app and it works fine now. So even after the TokenStream throws a RuntimeException the DocsWriter is still usable for subsequent docs. +1 for committing this soon!!
        Hide
        Michael McCandless added a comment -

        OK, thanks for testing it! I will commit shortly...

        Show
        Michael McCandless added a comment - OK, thanks for testing it! I will commit shortly...
        Hide
        Michael McCandless added a comment -

        OK I just committed this!

        Show
        Michael McCandless added a comment - OK I just committed this!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Alexei Dets
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development