Lucene - Core
  1. Lucene - Core
  2. LUCENE-4635

ArrayIndexOutOfBoundsException when a segment has many, many terms

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from Tom Burton-West's java-user thread "CheckIndex ArrayIndexOutOfBounds error for merged index" ( http://markmail.org/message/fatijkotwucn7hvu ).

      I modified Test2BTerms to instead generate a little over 10B terms, ran it (took 17 hours and created a 162 GB index) and hit a similar exception:

      Time: 62,164.058
      There was 1 failure:
      1) test2BTerms(org.apache.lucene.index.Test2BTerms)
      java.lang.ArrayIndexOutOfBoundsException: 1246
      	at org.apache.lucene.index.TermInfosReaderIndex.compareField(TermInfosReaderIndex.java:249)
      	at org.apache.lucene.index.TermInfosReaderIndex.compareTo(TermInfosReaderIndex.java:225)
      	at org.apache.lucene.index.TermInfosReaderIndex.getIndexOffset(TermInfosReaderIndex.java:156)
      	at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
      	at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
      	at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:539)
      	at org.apache.lucene.search.TermQuery$TermWeight$1.add(TermQuery.java:56)
      	at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:81)
      	at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:87)
      	at org.apache.lucene.util.ReaderUtil$Gather.run(ReaderUtil.java:70)
      	at org.apache.lucene.search.TermQuery$TermWeight.<init>(TermQuery.java:53)
      	at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:199)
      	at org.apache.lucene.search.Searcher.createNormalizedWeight(Searcher.java:168)
      	at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:664)
      	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:342)
      	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:330)
      	at org.apache.lucene.index.Test2BTerms.testSavedTerms(Test2BTerms.java:205)
      	at org.apache.lucene.index.Test2BTerms.test2BTerms(Test2BTerms.java:154)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      

      The index actually succeeded building and optimizing, but it was only when we went to run searches of the random terms we collected along the way that the AIOOBE was hit.

      I suspect this is a bug somewhere in the compact in-RAM terms index ... I'll dig.

      1. LUCENE-4635.patch
        4 kB
        Michael McCandless
      2. LUCENE-4635.patch
        0.5 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        I suspect this fixes the issue ... at least CheckIndex on my 162 GB index is getting beyond where it failed previously.

        I'll make a separate Test2BPagedBytes test!

        Show
        Michael McCandless added a comment - I suspect this fixes the issue ... at least CheckIndex on my 162 GB index is getting beyond where it failed previously. I'll make a separate Test2BPagedBytes test!
        Hide
        Michael McCandless added a comment -

        New patch, with test, and fixing another place where we could overflow int.

        I think it's ready.

        Show
        Michael McCandless added a comment - New patch, with test, and fixing another place where we could overflow int. I think it's ready.
        Hide
        Robert Muir added a comment -

        In general we should do a review/better testing of this pagebytes.

        Stuff like whats going on in copy() really scares me.

        But for now I think you should commit. Even if all of pagedbytes isnt totally safe, we should at least fix the terms index problems in 3.6.2 that it causes.

        I also think we should go for a 3.6.2 when this is fixed. We already have a nice amount of bugfixes sitting out there in the branch.

        Show
        Robert Muir added a comment - In general we should do a review/better testing of this pagebytes. Stuff like whats going on in copy() really scares me. But for now I think you should commit. Even if all of pagedbytes isnt totally safe, we should at least fix the terms index problems in 3.6.2 that it causes. I also think we should go for a 3.6.2 when this is fixed. We already have a nice amount of bugfixes sitting out there in the branch.
        Hide
        Michael McCandless added a comment -

        OK turns out this same issue was fixed in LUCENE-4568 for 4.x/5.x ... we just never backported to 3.6.x.

        Show
        Michael McCandless added a comment - OK turns out this same issue was fixed in LUCENE-4568 for 4.x/5.x ... we just never backported to 3.6.x.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1423718

        LUCENE-4635: add test

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1423718 LUCENE-4635 : add test
        Hide
        Michael McCandless added a comment -

        4.x/5.x were already fixed ...

        Thanks Tom!

        Show
        Michael McCandless added a comment - 4.x/5.x were already fixed ... Thanks Tom!
        Hide
        Commit Tag Bot added a comment -
        Show
        Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1423720 LUCENE-4635 : add test
        Hide
        Michael McCandless added a comment -

        I ran Test10BTerms (just Test2BTerms but multiply the number of terms by 5, and increase token len by 1) on 4.x and it passed!

        It was faster (14 hours vs I thikn 17 hours for 3.6.x), and index was smaller (129G vs 162G).

        Show
        Michael McCandless added a comment - I ran Test10BTerms (just Test2BTerms but multiply the number of terms by 5, and increase token len by 1) on 4.x and it passed! It was faster (14 hours vs I thikn 17 hours for 3.6.x), and index was smaller (129G vs 162G).
        Hide
        Michael McCandless added a comment -

        Woops! Thanks Steve.

        Mike McCandless

        http://blog.mikemccandless.com

        Show
        Michael McCandless added a comment - Woops! Thanks Steve. Mike McCandless http://blog.mikemccandless.com
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development