Lucene - Core
  1. Lucene - Core
  2. LUCENE-2501

ArrayIndexOutOfBoundsException in ByteBlockPool.allocSlice

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.1
    • Fix Version/s: 3.6, 4.0-BETA, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I'm seeing the following exception during indexing:

      Caused by: java.lang.ArrayIndexOutOfBoundsException: 14
      at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:118)
      at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:490)
      at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:511)
      at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:104)
      at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:120)
      at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:468)
      at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
      at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:246)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:774)
      at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:757)
      at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2085)
      ... 37 more
      

      This seems to be caused by the following code:

          final int level = slice[upto] & 15;
          final int newLevel = nextLevelArray[level];
          final int newSize = levelSizeArray[newLevel];
      

      this can result in "level" being a value between 0 and 14
      the array nextLevelArray is only of size 10

      i suspect the solution would be to either max the level to 10, or to add more entries to the nextLevelArray so it has 15 entries
      however, i don't know if something more is going wrong here and this is just where the exception hits from a deeper issue

      1. LUCENE-2501.patch
        1 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Hmmm, not good. Can you boil this down to a smallish test case?

        level should never be > 9, because nextLevelArray[*] is no greater than 9. Something more serious is up...

        Show
        Michael McCandless added a comment - Hmmm, not good. Can you boil this down to a smallish test case? level should never be > 9, because nextLevelArray [*] is no greater than 9. Something more serious is up...
        Hide
        Tim Smith added a comment -

        thats what i was afraid of

        i got this report second hand, so i don't have access to the data that was being ingested

        and i currently don't know enough about this section of the indexing code to guess in order to create a unit test
        i'll try to create a test, but i expect it will be difficult (especially if no one else has ever seen this)

        Show
        Tim Smith added a comment - thats what i was afraid of i got this report second hand, so i don't have access to the data that was being ingested and i currently don't know enough about this section of the indexing code to guess in order to create a unit test i'll try to create a test, but i expect it will be difficult (especially if no one else has ever seen this)
        Hide
        Michael McCandless added a comment -

        Is this issue repeatable, on a different machine?

        We do have a randomized test for this (TestByteSlices) – I'll go start it w/ big random.multiplier. Maybe it can uncover this

        Show
        Michael McCandless added a comment - Is this issue repeatable, on a different machine? We do have a randomized test for this (TestByteSlices) – I'll go start it w/ big random.multiplier. Maybe it can uncover this
        Hide
        Michael McCandless added a comment -

        What sized RAM buffer was being used for IW when this exception happened?

        Show
        Michael McCandless added a comment - What sized RAM buffer was being used for IW when this exception happened?
        Hide
        Tim Smith added a comment -

        Here's all the info i have available right now (will try to get more):

        16 core, 18-gig ram Windows 7 machine
        1 JVM
        16 index writers (each using default settings (64M ram, etc))
        300+ docs/sec ingestion (small documents)
        commit every 10 minutes
        optimize every hour

        The report i got indicated that every now and then one of these ArrayIndexOutOfBounds exceptions would occur
        this would result in the document being indexed failing, but otherwise things would continue normally

        Show
        Tim Smith added a comment - Here's all the info i have available right now (will try to get more): 16 core, 18-gig ram Windows 7 machine 1 JVM 16 index writers (each using default settings (64M ram, etc)) 300+ docs/sec ingestion (small documents) commit every 10 minutes optimize every hour The report i got indicated that every now and then one of these ArrayIndexOutOfBounds exceptions would occur this would result in the document being indexed failing, but otherwise things would continue normally
        Hide
        Tim Smith added a comment -

        Some more info:

        ingestion is being performed in multiple threads

        ArrayIndexOutOfBounds exception is occurring in bursts
        I suspect that these bursts of exceptions stop after the next commit (at which point the buffers are all reset)
        NOTE: i have not yet confirmed this, but i suspect it

        Show
        Tim Smith added a comment - Some more info: ingestion is being performed in multiple threads ArrayIndexOutOfBounds exception is occurring in bursts I suspect that these bursts of exceptions stop after the next commit (at which point the buffers are all reset) NOTE: i have not yet confirmed this, but i suspect it
        Hide
        Tim Smith added a comment -

        Looks like this may be the original source of the errors

        Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order (607 <= 607 )
        	at org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:76)
        	at org.apache.lucene.index.FreqProxTermsWriter.appendPostings(FreqProxTermsWriter.java:209)
        	at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:127)
        	at org.apache.lucene.index.TermsHash.flush(TermsHash.java:144)
        	at org.apache.lucene.index.DocInverter.flush(DocInverter.java:72)
        	at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:64)
        	at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:583)
        	at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3602)
        	at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3511)
        	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3502)
        	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2103)
        
        Show
        Tim Smith added a comment - Looks like this may be the original source of the errors Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order (607 <= 607 ) at org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:76) at org.apache.lucene.index.FreqProxTermsWriter.appendPostings(FreqProxTermsWriter.java:209) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:127) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:144) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:72) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:64) at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:583) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3602) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3511) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3502) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2103)
        Hide
        Michael McCandless added a comment -

        Are you certain about IW's RAM buffer size? If the RAM buffer size was close to 2GB it could lead to exceptions like this.

        Show
        Michael McCandless added a comment - Are you certain about IW's RAM buffer size? If the RAM buffer size was close to 2GB it could lead to exceptions like this.
        Hide
        Tim Smith added a comment -

        ram buffer size is set to 64.0

        Show
        Tim Smith added a comment - ram buffer size is set to 64.0
        Hide
        Michael McCandless added a comment -

        Can you capture IW.setInfoStream output leading up to it?

        Show
        Michael McCandless added a comment - Can you capture IW.setInfoStream output leading up to it?
        Hide
        Tim Smith added a comment -

        will do

        may take some time before it occurs again

        also, if this boils down to a synchronization error of some sort, the extra file io done to write the trace info to disk may add some implicit synchronization/slowdown that may result in not being able to reproduce the issue (i've seen this occur on non-lucene related synchronization issues, add the extra debug logging and it never fails anymore)

        Show
        Tim Smith added a comment - will do may take some time before it occurs again also, if this boils down to a synchronization error of some sort, the extra file io done to write the trace info to disk may add some implicit synchronization/slowdown that may result in not being able to reproduce the issue (i've seen this occur on non-lucene related synchronization issues, add the extra debug logging and it never fails anymore)
        Hide
        Michael McCandless added a comment -

        also, if this boils down to a synchronization error of some sort, the extra file io done to write the trace info to disk may add some implicit synchronization/slowdown that may result in not being able to reproduce the issue

        Ahh yes, the Heisenbug (http://en.wikipedia.org/wiki/Unusual_software_bug#Heisenbug). Still it's worth a shot to see if we can catch it in action...

        Show
        Michael McCandless added a comment - also, if this boils down to a synchronization error of some sort, the extra file io done to write the trace info to disk may add some implicit synchronization/slowdown that may result in not being able to reproduce the issue Ahh yes, the Heisenbug ( http://en.wikipedia.org/wiki/Unusual_software_bug#Heisenbug ). Still it's worth a shot to see if we can catch it in action...
        Hide
        Tim Smith added a comment -

        I've been informed that this exception is still happening

        however, whenever index tracing is turned on, it never seems to occur (extra logging seems to be preventing some lower level synchronization issue from surfacing)

        Show
        Tim Smith added a comment - I've been informed that this exception is still happening however, whenever index tracing is turned on, it never seems to occur (extra logging seems to be preventing some lower level synchronization issue from surfacing)
        Hide
        Michael McCandless added a comment -

        Could you still post the infoStream output? I can at least look at this to see how IW is being used.

        Does this occur on different machines? (Hardware issues could lead to exceptions like this).

        Show
        Michael McCandless added a comment - Could you still post the infoStream output? I can at least look at this to see how IW is being used. Does this occur on different machines? (Hardware issues could lead to exceptions like this).
        Hide
        Gili Nachum added a comment -

        Seeing a similar issue on 3.1.0. Was this ever resolved? or there's a workaround?

        Stack:

        00000049 SeedlistOpera Failed to process operation ADD java.lang.ArrayIndexOutOfBoundsException at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:135) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:502) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:523) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:106) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:126) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:479) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:169) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:248) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:701) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2194) at
        ...

        Show
        Gili Nachum added a comment - Seeing a similar issue on 3.1.0. Was this ever resolved? or there's a workaround? Stack: 00000049 SeedlistOpera Failed to process operation ADD java.lang.ArrayIndexOutOfBoundsException at org.apache.lucene.index.ByteBlockPool.allocSlice(ByteBlockPool.java:135) at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:502) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:523) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:106) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:126) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:479) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:169) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:248) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:701) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:2194) at ...
        Hide
        Michael McCandless added a comment -

        OK I found a possible cause behind this ... it was something I had
        fixed but didn't pull out and backport to 3.x LUCENE-3684.

        It's a thread safety issue, when FielfInfo.indexOptions changes from
        DOCS_AND_FREQS_AND_POSITIONS to not indexing positions. If this
        happens in one thread while a new thread is suddenly indexing a that
        same field there's a narrow window where the 2nd thread's
        FreqProxTermsWriterPerField can mis-report the streamCount as 1 when
        it should be 2.

        Attached patch (3.6.x) should fix it. I tried to get a thread test to
        provoke this but couldn't ... I think the window is too small (if I
        forcefully add sleeps at the "right time" in
        FreqProxTermsWriterPerField then I could provoke it...).

        Show
        Michael McCandless added a comment - OK I found a possible cause behind this ... it was something I had fixed but didn't pull out and backport to 3.x LUCENE-3684 . It's a thread safety issue, when FielfInfo.indexOptions changes from DOCS_AND_FREQS_AND_POSITIONS to not indexing positions. If this happens in one thread while a new thread is suddenly indexing a that same field there's a narrow window where the 2nd thread's FreqProxTermsWriterPerField can mis-report the streamCount as 1 when it should be 2. Attached patch (3.6.x) should fix it. I tried to get a thread test to provoke this but couldn't ... I think the window is too small (if I forcefully add sleeps at the "right time" in FreqProxTermsWriterPerField then I could provoke it...).
        Hide
        Simon Willnauer added a comment -

        sneaky, glad that this stuff is single threaded in 4.0

        Show
        Simon Willnauer added a comment - sneaky, glad that this stuff is single threaded in 4.0
        Hide
        Michael McCandless added a comment -

        I'm glad too!

        Show
        Michael McCandless added a comment - I'm glad too!
        Hide
        Robert Muir added a comment -

        +1

        Show
        Robert Muir added a comment - +1
        Hide
        Michael McCandless added a comment -

        I committed the patch, but I'll leave this open until we can hear back from Tim or Gili that this has resolved the issue ...

        Show
        Michael McCandless added a comment - I committed the patch, but I'll leave this open until we can hear back from Tim or Gili that this has resolved the issue ...
        Hide
        Gili Nachum added a comment -

        Issue resolved successfully. Even when increasing the degree of concurrency, I can no longer reproduce with 16 threads over 4 core machine.
        Thank you Michael!

        Show
        Gili Nachum added a comment - Issue resolved successfully. Even when increasing the degree of concurrency, I can no longer reproduce with 16 threads over 4 core machine. Thank you Michael!
        Hide
        Michael McCandless added a comment -

        Thanks for bringing closure, Gili.

        Show
        Michael McCandless added a comment - Thanks for bringing closure, Gili.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tim Smith
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development