Lucene - Core
  1. Lucene - Core
  2. LUCENE-1282

Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3, 2.3.1
    • Fix Version/s: 2.4
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is not a Lucene bug. It's an as-yet not fully characterized Sun
      JRE bug, as best I can tell. I'm opening this to gather all things we
      know, and to work around it in Lucene if possible, and maybe open an
      issue with Sun if we can reduce it to a compact test case.

      It's hit at least 3 users:

      http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/%3c8c4e68610803180438x39737565q9f97b4802ed774a5@mail.gmail.com%3e
      http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/%3c4807654E.7050900@virginia.edu%3e
      http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/%3c733777220805060156t7fdb8fectf0bc984fbfe48a22@mail.gmail.com%3e

      It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects
      Lucene. Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06
      shows it.

      The bug affects bulk merging of stored fields. When it strikes, the
      segment produced by a merge is corrupt because its fdx file (stored
      fields index file) is missing one document. After iterating many
      times with the first user that hit this, adding diagnostics &
      assertions, its seems that a call to fieldsWriter.addDocument some
      either fails to run entirely, or, fails to invoke its call to
      indexStream.writeLong. It's as if when hotspot compiles a method,
      there's some sort of race condition in cutting over to the compiled
      code whereby a single method call fails to be invoked (speculation).

      Unfortunately, this corruption is silent when it occurs and only later
      detected when a merge tries to merge the bad segment, or an
      IndexReader tries to open it. Here's a typical merge exception:

      Exception in thread "Thread-10" 
      org.apache.lucene.index.MergePolicy$MergeException: 
      org.apache.lucene.index.CorruptIndexException:
          doc counts differ for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000
              at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
      Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000
              at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
              at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
              at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221)
              at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099)
              at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
              at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
      

      and here's a typical exception hit when opening a searcher:

      org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _kk: fieldsReader shows 72670 but segmentInfo shows 72671
              at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
              at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
              at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:230)
              at org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:73)
              at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
              at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
              at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
              at org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
              at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:48)
      

      Sometimes, adding -Xbatch (forces up front compilation) or -Xint
      (disables compilation) to the java command line works around the
      issue.

      Here are some of the OS's we've seen the failure on:

      SuSE 10.0
      Linux phoebe 2.6.13-15-smp #1 SMP Tue Sep 13 14:56:15 UTC 2005 x86_64 
      x86_64 x86_64 GNU/Linux 
      
      SuSE 8.2
      Linux phobos 2.4.20-64GB-SMP #1 SMP Mon Mar 17 17:56:03 UTC 2003 i686 
      unknown unknown GNU/Linux 
      
      Red Hat Enterprise Linux Server release 5.1 (Tikanga)
      Linux lab8.betech.virginia.edu 2.6.18-53.1.14.el5 #1 SMP Tue Feb 19 
      07:18:21 EST 2008 i686 i686 i386 GNU/Linux
      

      I've already added assertions to Lucene to detect when this bug
      strikes, but since assertions are not usually enabled, I plan to add a
      real check to catch when this bug strikes before we commit the merge
      to the index. This way we can detect & quarantine the failure and
      prevent corruption from entering the index.

      1. hs_err_pid27359.log
        13 kB
        Michael McCandless
      2. crashtest.log
        2 kB
        andreaskohn
      3. crashtest
        1 kB
        andreaskohn
      4. corrupt_merge_out15.txt
        49 kB
        Stu Hood

        Activity

        Hide
        Paul Smith added a comment -

        Another workaround might be to use '-client' instead of the default '-server' (for server class machines). This affects a few things, not least this switch:

        -XX:CompileThreshold=10000 Number of method invocations/branches before compiling [-client: 1,500]

        -server implies a 10000 value. I have personally observed similar behaviour like problems like the above with -server, and usually -client ends up 'solving' them.

        I'm sure there was also a way to mark a method to not jit compile too (rather than resort to -Xint which disables i for everything), but now I cant' find what that syntax is at all.

        Show
        Paul Smith added a comment - Another workaround might be to use '-client' instead of the default '-server' (for server class machines). This affects a few things, not least this switch: -XX:CompileThreshold=10000 Number of method invocations/branches before compiling [-client: 1,500] -server implies a 10000 value. I have personally observed similar behaviour like problems like the above with -server, and usually -client ends up 'solving' them. I'm sure there was also a way to mark a method to not jit compile too (rather than resort to -Xint which disables i for everything), but now I cant' find what that syntax is at all.
        Hide
        Bruce Ritchie added a comment -

        From Mark Miller on the developer's mailing list:

        Here's a couple examples of that exclude method syntax (had to use it recently with eclipse):

        -XX:CompileCommand=exclude,org/apache/lucene/index/IndexReader\$1,doBody
        -XX:CompileCommand=exclude,org/eclipse/core/internal/dtree/DataTreeNode,forwardDeltaWith

        Show
        Bruce Ritchie added a comment - From Mark Miller on the developer's mailing list: Here's a couple examples of that exclude method syntax (had to use it recently with eclipse): -XX:CompileCommand=exclude,org/apache/lucene/index/IndexReader\$1,doBody -XX:CompileCommand=exclude,org/eclipse/core/internal/dtree/DataTreeNode,forwardDeltaWith
        Hide
        Stu Hood added a comment -

        I've also been struck by this bug, with Lucene 2.3.2. I'd been running for a while with JRE 1.6.0_05 when I noticed it, so I downgraded to JRE 1.6.0_02 to try and work around it, but no luck.

        Could a bugged index created with JRE 1.6.0_05 be causing addIndexesNoOptimize to trigger this bug, even with JRE 1.6.0_02?

        Thanks.

        Show
        Stu Hood added a comment - I've also been struck by this bug, with Lucene 2.3.2. I'd been running for a while with JRE 1.6.0_05 when I noticed it, so I downgraded to JRE 1.6.0_02 to try and work around it, but no luck. Could a bugged index created with JRE 1.6.0_05 be causing addIndexesNoOptimize to trigger this bug, even with JRE 1.6.0_02? Thanks.
        Hide
        Michael McCandless added a comment -

        Could a bugged index created with JRE 1.6.0_05 be causing addIndexesNoOptimize to trigger this bug, even with JRE 1.6.0_02

        Unfortunately, yes. Once the corruption enters the index, then no matter which JRE you are using, you will hit that exception.

        In your case, I can see that indeed segment _2y9, which is pre-existing when you call addIndexesNoOptimize, is the corrupt segment.

        In general, you can use CheckIndex to see if you have any latent corruption.

        I'm afraid you either have to run CheckIndex -fix to remove that segment (and possibly others that are also corrupt) from your index, or, create a new index.

        This bug is very frustrating!

        Can you describe how you built up this index? EG was this bulk created (open a single writer, add all the docs, close it), or, created with many separate instances of IndexWriter over time? Were documents added via add/updateDocument or via addIndexes*? Do you run the JRE with any "interesting" command-line options? I'd really like to narrow down the "typical" cases when this bug strikes if we can...

        Show
        Michael McCandless added a comment - Could a bugged index created with JRE 1.6.0_05 be causing addIndexesNoOptimize to trigger this bug, even with JRE 1.6.0_02 Unfortunately, yes. Once the corruption enters the index, then no matter which JRE you are using, you will hit that exception. In your case, I can see that indeed segment _2y9, which is pre-existing when you call addIndexesNoOptimize, is the corrupt segment. In general, you can use CheckIndex to see if you have any latent corruption. I'm afraid you either have to run CheckIndex -fix to remove that segment (and possibly others that are also corrupt) from your index, or, create a new index. This bug is very frustrating! Can you describe how you built up this index? EG was this bulk created (open a single writer, add all the docs, close it), or, created with many separate instances of IndexWriter over time? Were documents added via add/updateDocument or via addIndexes*? Do you run the JRE with any "interesting" command-line options? I'd really like to narrow down the "typical" cases when this bug strikes if we can...
        Hide
        Nick Menere added a comment -

        We had a lot of customers report this bug, though it was about 2 years ago that sun fixed this so we only see it in very rare cases now, and even then our fix is... "Please upgrade you JVM".

        I am flat out at the moment, but if you need, I will try and get some more info on this if you want.

        Show
        Nick Menere added a comment - We had a lot of customers report this bug , though it was about 2 years ago that sun fixed this so we only see it in very rare cases now, and even then our fix is... "Please upgrade you JVM". I am flat out at the moment, but if you need, I will try and get some more info on this if you want.
        Hide
        Nick Menere added a comment -

        Actually, not convinced it is the same bug.. We kept getting complete JVM crashes... I just assumed it was (I wouldn't be surprised if it was related though).

        Show
        Nick Menere added a comment - Actually, not convinced it is the same bug.. We kept getting complete JVM crashes... I just assumed it was (I wouldn't be surprised if it was related though).
        Hide
        Michael McCandless added a comment -

        Another datapoint from Ian Lea:

        My job (http://lucene.markmail.org/message/awkkunr7j24nh4qj) still
        fails with java version 1.6.0_06 (build 1.6.0_06-b02), downloaded
        today, with both lucene 2.3.1 and 2.3.2.

        For me, downgrading to 1.6.0_03-b05 fixed things.

        Show
        Michael McCandless added a comment - Another datapoint from Ian Lea: My job ( http://lucene.markmail.org/message/awkkunr7j24nh4qj ) still fails with java version 1.6.0_06 (build 1.6.0_06-b02), downloaded today, with both lucene 2.3.1 and 2.3.2. For me, downgrading to 1.6.0_03-b05 fixed things.
        Hide
        Michael McCandless added a comment -

        I finally managed to reproduce this JVM bug, except, my case happens
        while merging term vectors (mergeVectors) not stored fields as all
        other cases seem to be.

        I'm running JRE 1.6.0_05 on a Debian Linux box.

        In my case, which just uses a modified contrib/benchmark to add 2000
        wikipedia docs to a large index, I got to the point (when index was 19
        GB) where every single time I run the benchmark alg, it hits an
        exception. Often the exception looks like this:

        Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: read past EOF
        	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323)
        	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300)
        Caused by: java.io.IOException: read past EOF
        	at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
        	at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
        	at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68)
        	at org.apache.lucene.store.IndexInput.readLong(IndexInput.java:91)
        	at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:345)
        	at org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:992)
        	at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:441)
        	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:138)
        	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)
        	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
        	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
        	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)
        

        I then added the same check that we now have for mergeFields, ie, to
        verify the size of index file (_X.tvx) matches the number of
        documents merged. Sometimes, however, I see this different exception:

        Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: 9375
        	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323)
        	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300)
        Caused by: java.lang.ArrayIndexOutOfBoundsException: 9375
        	at org.apache.lucene.store.BufferedIndexOutput.writeByte(BufferedIndexOutput.java:36)
        	at org.apache.lucene.store.IndexOutput.writeVInt(IndexOutput.java:71)
        	at org.apache.lucene.index.TermVectorsWriter.addAllDocVectors(TermVectorsWriter.java:76)
        	at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:443)
        	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:138)
        	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998)
        	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650)
        	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214)
        	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269)
        

        where the particular array index would vary all over the place. This
        is VERY odd because that array is the buffer in BufferedIndexOutput
        and is always allocated to 16384 bytes so 9375 (and all others I saw)
        is not out of bounds.

        JRE 1.5.0_08 always runs fine. Likewise running JRE 1.6.0_05 with
        -Xint also runs fine. However, JRE 1.6.0_05 with -Xbatch still hits
        exceptions.

        So then I started testing "trivial" modifications to the Java source
        code in the mergeVectors, and found, insanely, that this simple diff
        completely stopped the exceptions:

                     } else {
        -              termVectorsWriter.addAllDocVectors(reader.getTermFreqVectors(docNum));
        +              // NOTE: it's very important to first assign
        +              // to vectors then pass it to
        +              // termVectorsWriter.addAllDocVectors; see
        +              // LUCENE-1282
        +              TermFreqVector[] vectors = reader.getTermFreqVectors(docNum);
        +              termVectorsWriter.addAllDocVectors(vectors);
        

        (Ie, just forcing an assignment to a local variable).

        It's crazy that such a trivial mod actually makes a difference in this
        JRE bug (I would have expected it to be optimized away fairly early on
        in compilation), but, I'm quite sure that diff resolves at least the
        exceptions I've been seeing. So I plan to commit this JRE bug
        workaround to 2.4 & 2.3 branch.

        I still haven't been able to hit the JRE bug when merging stored
        fields, but, I'm still making that same corresponding mod to
        mergeFields.

        Show
        Michael McCandless added a comment - I finally managed to reproduce this JVM bug, except, my case happens while merging term vectors (mergeVectors) not stored fields as all other cases seem to be. I'm running JRE 1.6.0_05 on a Debian Linux box. In my case, which just uses a modified contrib/benchmark to add 2000 wikipedia docs to a large index, I got to the point (when index was 19 GB) where every single time I run the benchmark alg, it hits an exception. Often the exception looks like this: Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: read past EOF at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300) Caused by: java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68) at org.apache.lucene.store.IndexInput.readLong(IndexInput.java:91) at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:345) at org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:992) at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:441) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:138) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269) I then added the same check that we now have for mergeFields, ie, to verify the size of index file (_X.tvx) matches the number of documents merged. Sometimes, however, I see this different exception: Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: 9375 at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300) Caused by: java.lang.ArrayIndexOutOfBoundsException: 9375 at org.apache.lucene.store.BufferedIndexOutput.writeByte(BufferedIndexOutput.java:36) at org.apache.lucene.store.IndexOutput.writeVInt(IndexOutput.java:71) at org.apache.lucene.index.TermVectorsWriter.addAllDocVectors(TermVectorsWriter.java:76) at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:443) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:138) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269) where the particular array index would vary all over the place. This is VERY odd because that array is the buffer in BufferedIndexOutput and is always allocated to 16384 bytes so 9375 (and all others I saw) is not out of bounds. JRE 1.5.0_08 always runs fine. Likewise running JRE 1.6.0_05 with -Xint also runs fine. However, JRE 1.6.0_05 with -Xbatch still hits exceptions. So then I started testing "trivial" modifications to the Java source code in the mergeVectors, and found, insanely, that this simple diff completely stopped the exceptions: } else { - termVectorsWriter.addAllDocVectors(reader.getTermFreqVectors(docNum)); + // NOTE: it's very important to first assign + // to vectors then pass it to + // termVectorsWriter.addAllDocVectors; see + // LUCENE-1282 + TermFreqVector[] vectors = reader.getTermFreqVectors(docNum); + termVectorsWriter.addAllDocVectors(vectors); (Ie, just forcing an assignment to a local variable). It's crazy that such a trivial mod actually makes a difference in this JRE bug (I would have expected it to be optimized away fairly early on in compilation), but, I'm quite sure that diff resolves at least the exceptions I've been seeing. So I plan to commit this JRE bug workaround to 2.4 & 2.3 branch. I still haven't been able to hit the JRE bug when merging stored fields, but, I'm still making that same corresponding mod to mergeFields.
        Hide
        Yonik Seeley added a comment -

        See, that complex code even confuses the JVM
        Awesome job coming up with this workaround! (crosses fingers for stored fields)

        Show
        Yonik Seeley added a comment - See, that complex code even confuses the JVM Awesome job coming up with this workaround! (crosses fingers for stored fields)
        Hide
        Ismael Juma added a comment -

        Hi,

        Great work on tracking this down, it looks like a very nasty bug. Has it been reported to Sun yet? It seems like the kind of bug that could manifest itself in other places too, so important to get a real fix.

        Show
        Ismael Juma added a comment - Hi, Great work on tracking this down, it looks like a very nasty bug. Has it been reported to Sun yet? It seems like the kind of bug that could manifest itself in other places too, so important to get a real fix.
        Hide
        Paul Smith added a comment -

        Throwing up an idea here for consideration. I'm sure it could be shot down, but I thought I'd raise it just in case it hasn't already been considered and discarded..

        One of the classic problems between -client and -server mode is the way the CPU registers are used. Is it possible that some of the fields are suffering from concurrency issues? I was wondering if, say, BufferedInfexOutput.buffer* may need to be marked volatile ?

        One easy way to test if this makes a difference is to just try switching between explicit use of '-client' and '-server'. Most newer machines (even desktops & laptops) appear to qualify for Sun's 'am I a server-class machine' check. By switching to -client, if these problems disappear, this to me would smell more and more like a 'volatile' like behaviour, because AIUI, -server will be more aggressive with some of it's register optimizations and I've seen behaviour just like this where variables that have clearly been written, the changes are not 'appearing' on the other side. Even the same thread marking the change can be switched across to a different CPU right in the middle, and could see different results.

        Of course those people with lots of concurrency experience can probably dismiss this theory in a second, but that's fine.

        Show
        Paul Smith added a comment - Throwing up an idea here for consideration. I'm sure it could be shot down, but I thought I'd raise it just in case it hasn't already been considered and discarded.. One of the classic problems between -client and -server mode is the way the CPU registers are used. Is it possible that some of the fields are suffering from concurrency issues? I was wondering if, say, BufferedInfexOutput.buffer* may need to be marked volatile ? One easy way to test if this makes a difference is to just try switching between explicit use of '-client' and '-server'. Most newer machines (even desktops & laptops) appear to qualify for Sun's 'am I a server-class machine' check. By switching to -client, if these problems disappear, this to me would smell more and more like a 'volatile' like behaviour, because AIUI, -server will be more aggressive with some of it's register optimizations and I've seen behaviour just like this where variables that have clearly been written, the changes are not 'appearing' on the other side. Even the same thread marking the change can be switched across to a different CPU right in the middle, and could see different results. Of course those people with lots of concurrency experience can probably dismiss this theory in a second, but that's fine.
        Hide
        Michael McCandless added a comment -

        Has it been reported to Sun yet? It seems like the kind of bug that could manifest itself in other places too, so important to get a real fix.

        Not yet, but I intend to. I'm trying to whittle it down. I agree the bug is nasty and could strike again at any time. The AIOOB exceptions I was hitting were truly bizarre.

        Show
        Michael McCandless added a comment - Has it been reported to Sun yet? It seems like the kind of bug that could manifest itself in other places too, so important to get a real fix. Not yet, but I intend to. I'm trying to whittle it down. I agree the bug is nasty and could strike again at any time. The AIOOB exceptions I was hitting were truly bizarre.
        Hide
        Michael McCandless added a comment -

        One of the classic problems between -client and -server mode is the way the CPU registers are used. Is it possible that some of the fields are suffering from concurrency issues? I was wondering if, say, BufferedInfexOutput.buffer* may need to be marked volatile ?

        In my 100% reproducible case of this JRE bug, I'm using only 1 thread, so I don't think a volatile should be necessary here.

        But I like your idea to try -client vs -server – I will test that & post back. The more data we can gather the better... I did find it interesting that -Xbatch did NOT resolve it, but has for at least one of the above users.

        I'm wondering if it has something to do with writing to large (> 32 bit) files. In my test case, the index keeps kicking off a large merge (produces 2.7 GB segment) and it's that merge that trips the bug.

        Show
        Michael McCandless added a comment - One of the classic problems between -client and -server mode is the way the CPU registers are used. Is it possible that some of the fields are suffering from concurrency issues? I was wondering if, say, BufferedInfexOutput.buffer* may need to be marked volatile ? In my 100% reproducible case of this JRE bug, I'm using only 1 thread, so I don't think a volatile should be necessary here. But I like your idea to try -client vs -server – I will test that & post back. The more data we can gather the better... I did find it interesting that -Xbatch did NOT resolve it, but has for at least one of the above users. I'm wondering if it has something to do with writing to large (> 32 bit) files. In my test case, the index keeps kicking off a large merge (produces 2.7 GB segment) and it's that merge that trips the bug.
        Hide
        Michael McCandless added a comment -

        In my 100% reproducible case of this JRE bug, I'm using only 1 thread, so I don't think a volatile should be necessary here.

        Woops: I am, however, using the default ConcurrentMergeScheduler, so this very-large merge runs in its own thread. Still, it's only that one thread that's accessing this code/state, so by the spec volatile should not be necessary.

        Show
        Michael McCandless added a comment - In my 100% reproducible case of this JRE bug, I'm using only 1 thread, so I don't think a volatile should be necessary here. Woops: I am, however, using the default ConcurrentMergeScheduler, so this very-large merge runs in its own thread. Still, it's only that one thread that's accessing this code/state, so by the spec volatile should not be necessary.
        Hide
        Michael McCandless added a comment -

        OK: running with -client prevents the bug.

        Running with SerialMergeScheduler still shows the bug.

        I'm going to try to make a standalone test that just runs this one merge....

        Show
        Michael McCandless added a comment - OK: running with -client prevents the bug. Running with SerialMergeScheduler still shows the bug. I'm going to try to make a standalone test that just runs this one merge....
        Hide
        Michael McCandless added a comment -

        Using the 19 GB index I have that consistently reproduces this hotspot bug, I boiled the bug down to a very small testcase that no longer involves Lucene.

        However, this occurence of the bug is slightly different: for me, by specifying -Xbatch to java command line, the bug consistently happens. It only rarely happens without -Xbatch. Nonetheless, I'm hopeful that if Sun fixes this one test case properly, it will fix all the odd exceptions we've been seeing from this code.

        I opened the bug 4 days ago (5/15) with http://bugs.sun.com, but have yet to hear if it's been accepted as a real bug.

        if others could try out the code below on their Linux boxes, using 1.6.0_04/05 of Sun's java, specifying -Xbatch, to see if the bug can be reproduced, that'd be great.

        Here's the bug I opened:

        Date Created: Thu May 15 11:53:15 MDT 2008
        Type:        bug
        Customer Name:   Michael McCandless
        Customer Email:  mail@mikemccandless.com
        SDN ID:       mail@mikemccandless.com
        status:      Waiting
        Category:    hotspot
        Subcategory: runtime_system
        Company:     IBM
        release:     6
        hardware:    x86
        OSversion:   linux
        priority:    4
        Synopsis:    Simple code runs incorrectly with -Xbatch
        Description:
         FULL PRODUCT VERSION :
        java version "1.6.0_06"
        Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
        Java HotSpot(TM) Server VM (build 10.0-b22, mixed mode)
        
        
        
        FULL OS VERSION :
        Linux 2.6.22.1 #7 SMP PREEMPT Tue Mar 18 18:22:09 EDT 2008 i686 GNU/Linux
        
        A DESCRIPTION OF THE PROBLEM :
        On the Apache Lucene project, we've now had 4 users hit by an apparent
        JRE bug.  When this bug strikes, it silently corrupts the search
        index, which is very costly to the user (makes the index unusable).
        Details are here:
        
          https://issues.apache.org/jira/browse/LUCENE-1282
        
        I can reliably reproduce the bug, but only on a very large (19 GB)
        search index.  But I narrowed down one variant of the bug to attached
        test case.
        
        
        
        THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: No
        
        THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes
        
        STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
        Compile and run the attached code (Crash.java), with -Xbatch and it should fail (ie, throw the
        RuntimeException, incorrectly).  It should pass without -Xbatch.
        
        
        
        
        
        EXPECTED VERSUS ACTUAL BEHAVIOR :
        Expected is no RuntimeException should be thrown.  Actual is it is thrown.
        REPRODUCIBILITY :
        This bug can be reproduced always.
        
        ---------- BEGIN SOURCE ----------
        public class Crash {
        
          public static void main(String[] args) {
            new Crash().crash();
          }
        
          private Object alwaysNull;
        
          final void crash() throws Throwable {
            for (int r = 0; r < 3; r++) {
              for (int docNum = 0; docNum < 10000;) {
                if (r < 2) {
                  for(int j=0;j<3000;j++)
                    docNum++;
                } else {
                  docNum++;
                  doNothing(getNothing());
                  if (alwaysNull != null) {
                    throw new RuntimeException("BUG: checkAbort is always null: r=" + r + " of 3; docNum=" + docNum);
                  }
                }
              }
            }
          }
        
          Object getNothing() {
            return this;
          }
        
          int x;
          void doNothing(Object o) {
            x++;
          }
        }
        
        
        ---------- END SOURCE ----------
        
        CUSTOMER SUBMITTED WORKAROUND :
        Don't specify -Xbatch.  You can also tweak the code to have it pass the test.  Reducing the 10000
        or 3000 low enough makes it pass.  Changing the doNothing(...)  line
        to assign the result of getNothing() to an intermediate variable
        first, also passes (this is the approach we plan to use for Lucene). Removing the x++ also passes.
        workaround:  
        comments:    (company - IBM , email - mail@mikemccandless.com)
        
        Show
        Michael McCandless added a comment - Using the 19 GB index I have that consistently reproduces this hotspot bug, I boiled the bug down to a very small testcase that no longer involves Lucene. However, this occurence of the bug is slightly different: for me, by specifying -Xbatch to java command line, the bug consistently happens. It only rarely happens without -Xbatch. Nonetheless, I'm hopeful that if Sun fixes this one test case properly, it will fix all the odd exceptions we've been seeing from this code. I opened the bug 4 days ago (5/15) with http://bugs.sun.com , but have yet to hear if it's been accepted as a real bug. if others could try out the code below on their Linux boxes, using 1.6.0_04/05 of Sun's java, specifying -Xbatch, to see if the bug can be reproduced, that'd be great. Here's the bug I opened: Date Created: Thu May 15 11:53:15 MDT 2008 Type: bug Customer Name: Michael McCandless Customer Email: mail@mikemccandless.com SDN ID: mail@mikemccandless.com status: Waiting Category: hotspot Subcategory: runtime_system Company: IBM release: 6 hardware: x86 OSversion: linux priority: 4 Synopsis: Simple code runs incorrectly with -Xbatch Description: FULL PRODUCT VERSION : java version "1.6.0_06" Java(TM) SE Runtime Environment (build 1.6.0_06-b02) Java HotSpot(TM) Server VM (build 10.0-b22, mixed mode) FULL OS VERSION : Linux 2.6.22.1 #7 SMP PREEMPT Tue Mar 18 18:22:09 EDT 2008 i686 GNU/Linux A DESCRIPTION OF THE PROBLEM : On the Apache Lucene project, we've now had 4 users hit by an apparent JRE bug. When this bug strikes, it silently corrupts the search index, which is very costly to the user (makes the index unusable). Details are here: https: //issues.apache.org/jira/browse/LUCENE-1282 I can reliably reproduce the bug, but only on a very large (19 GB) search index. But I narrowed down one variant of the bug to attached test case . THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: No THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes STEPS TO FOLLOW TO REPRODUCE THE PROBLEM : Compile and run the attached code (Crash.java), with -Xbatch and it should fail (ie, throw the RuntimeException, incorrectly). It should pass without -Xbatch. EXPECTED VERSUS ACTUAL BEHAVIOR : Expected is no RuntimeException should be thrown. Actual is it is thrown. REPRODUCIBILITY : This bug can be reproduced always. ---------- BEGIN SOURCE ---------- public class Crash { public static void main( String [] args) { new Crash().crash(); } private Object alwaysNull; final void crash() throws Throwable { for ( int r = 0; r < 3; r++) { for ( int docNum = 0; docNum < 10000;) { if (r < 2) { for ( int j=0;j<3000;j++) docNum++; } else { docNum++; doNothing(getNothing()); if (alwaysNull != null ) { throw new RuntimeException( "BUG: checkAbort is always null : r=" + r + " of 3; docNum=" + docNum); } } } } } Object getNothing() { return this ; } int x; void doNothing( Object o) { x++; } } ---------- END SOURCE ---------- CUSTOMER SUBMITTED WORKAROUND : Don't specify -Xbatch. You can also tweak the code to have it pass the test. Reducing the 10000 or 3000 low enough makes it pass. Changing the doNothing(...) line to assign the result of getNothing() to an intermediate variable first, also passes ( this is the approach we plan to use for Lucene). Removing the x++ also passes. workaround: comments: (company - IBM , email - mail@mikemccandless.com)
        Hide
        Ismael Juma added a comment -

        Great work finding a reduced test-case. I tried the sample application with JDK6u4 32-bit and I can add the following:

        • It always happens with -Xbatch -server when the code is compiled with javac.
        • It happens sometimes (sometimes after 3 attempts, sometimes after 10, etc.) with -server when the code is compiled with javac.
        • I am unable to reproduce it when compiling with the eclipse compiler (running from the command-line to avoid any other differences)
        Show
        Ismael Juma added a comment - Great work finding a reduced test-case. I tried the sample application with JDK6u4 32-bit and I can add the following: It always happens with -Xbatch -server when the code is compiled with javac. It happens sometimes (sometimes after 3 attempts, sometimes after 10, etc.) with -server when the code is compiled with javac. I am unable to reproduce it when compiling with the eclipse compiler (running from the command-line to avoid any other differences)
        Hide
        Ismael Juma added a comment -

        Btw, if I increase the number of iterations to 1000000 for docNum and 300000 for j I can reproduce it every time without -Xbatch.

        Show
        Ismael Juma added a comment - Btw, if I increase the number of iterations to 1000000 for docNum and 300000 for j I can reproduce it every time without -Xbatch.
        Hide
        Ismael Juma added a comment -

        It's worth noting that jdk 6u10 beta b24 (released today) and openjdk6 in Fedora 9 are also affected by the problem shown in the test-case.

        Show
        Ismael Juma added a comment - It's worth noting that jdk 6u10 beta b24 (released today) and openjdk6 in Fedora 9 are also affected by the problem shown in the test-case.
        Hide
        Michael McCandless added a comment -

        This bug is spooky. I tried another workaround, which is to just increment an unused variable, instead of the above diff but at the same spot. That then causes the JRE to reliably crash (SEGV). I'm attaching the hs_err log.

        Sun has not yet "accepted" my bug. If/when they do, I'll attach this error log to it.

        Show
        Michael McCandless added a comment - This bug is spooky. I tried another workaround, which is to just increment an unused variable, instead of the above diff but at the same spot. That then causes the JRE to reliably crash (SEGV). I'm attaching the hs_err log. Sun has not yet "accepted" my bug. If/when they do, I'll attach this error log to it.
        Hide
        Michael McCandless added a comment -

        OK I've committed the workaround & bug detection to trunk (2.4) and 2.3 branch.

        At this point I think that's all we can do here; we are now waiting on Sun to fix the JRE bug.

        Show
        Michael McCandless added a comment - OK I've committed the workaround & bug detection to trunk (2.4) and 2.3 branch. At this point I think that's all we can do here; we are now waiting on Sun to fix the JRE bug.
        Hide
        andreaskohn added a comment -

        We've seen this bug (rarely) when indexing quite huge amounts of data.

        Just to add some datapoints, attached is crashtest, using the above Crash.java to test all java VMs I have currently available.
        crashtest.log contains the output.

        Tests were run on a loaded EM64T dual core machine with fedora 9/x86_64, all VMs are 64bit. The openjdk is a build from yesterdays public repository contents, build using gcc 4.3 (trivial patches to make it build were added).

        Some scary solaris (SunOS 5.10 Generic_120011-14 sun4u sparc SUNW,UltraAX-i2) results as well:

        /usr/jdk/jdk1.6.0_04 (java full version "1.6.0_04-b12"):
        : 0/200 failed: PASS
        -server: 0/200 failed: PASS
        -client: 0/200 failed: PASS
        -Xbatch: 0/200 failed: PASS
        -Xint: 0/200 failed: PASS

        /usr/jdk/jdk1.6.0_04 (java full version "1.6.0_04-b12"):
        -d64: 0/200 failed: PASS
        -server -d64: 0/200 failed: PASS
        -client -d64: 0/200 failed: PASS
        -Xbatch -d64: 0/200 failed: PASS
        -Xint -d64: 0/200 failed: PASS

        Show
        andreaskohn added a comment - We've seen this bug (rarely) when indexing quite huge amounts of data. Just to add some datapoints, attached is crashtest , using the above Crash.java to test all java VMs I have currently available. crashtest.log contains the output. Tests were run on a loaded EM64T dual core machine with fedora 9/x86_64, all VMs are 64bit. The openjdk is a build from yesterdays public repository contents, build using gcc 4.3 (trivial patches to make it build were added). Some scary solaris (SunOS 5.10 Generic_120011-14 sun4u sparc SUNW,UltraAX-i2) results as well: /usr/jdk/jdk1.6.0_04 (java full version "1.6.0_04-b12"): : 0/200 failed: PASS -server: 0/200 failed: PASS -client: 0/200 failed: PASS -Xbatch: 0/200 failed: PASS -Xint: 0/200 failed: PASS /usr/jdk/jdk1.6.0_04 (java full version "1.6.0_04-b12"): -d64: 0/200 failed: PASS -server -d64: 0/200 failed: PASS -client -d64: 0/200 failed: PASS -Xbatch -d64: 0/200 failed: PASS -Xint -d64: 0/200 failed: PASS
        Hide
        Michael McCandless added a comment - - edited
        Show
        Michael McCandless added a comment - - edited Here is the bug at Sun: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6707044
        Hide
        Jed Wesley-Smith added a comment -

        Sun has posted their evaluation on the bug above and accepted it as High priority.

        Show
        Jed Wesley-Smith added a comment - Sun has posted their evaluation on the bug above and accepted it as High priority.
        Hide
        Paul Smith added a comment -

         Can anyone comment as to whether the JRE 1.6.04+ bug affects any earlier versions of Lucene? (say, 2.0.. which we're still using) .

        I was just reviewing this issue and noticed Michael mentioned this behaviour shows in both the ConcurrentMergeScheduler and the SerialMergeScheduler. AIUI,. the SerialMergeScheduler is effectively the 'old' way of previous versions of Lucene, so I'm just starting to think about what affect 1.6.04 might have on earlier versions? (this bug is only marked as affecting 2.3+).

        The reason I ask is that we're just about to upgrade to 1.6.04 -server in some of our production machines.. (reason why not going to 1.6.06 is we only started our development test cycle months ago and stuck with .04 until next cycle).

        Show
        Paul Smith added a comment -  Can anyone comment as to whether the JRE 1.6.04+ bug affects any earlier versions of Lucene? (say, 2.0.. which we're still using) . I was just reviewing this issue and noticed Michael mentioned this behaviour shows in both the ConcurrentMergeScheduler and the SerialMergeScheduler. AIUI,. the SerialMergeScheduler is effectively the 'old' way of previous versions of Lucene, so I'm just starting to think about what affect 1.6.04 might have on earlier versions? (this bug is only marked as affecting 2.3+). The reason I ask is that we're just about to upgrade to 1.6.04 -server in some of our production machines.. (reason why not going to 1.6.06 is we only started our development test cycle months ago and stuck with .04 until next cycle).
        Hide
        Michael McCandless added a comment -

        Can anyone comment as to whether the JRE 1.6.04+ bug affects any earlier versions of Lucene? (say, 2.0.. which we're still using).

        As far as I know, this corruption only happens on Lucene 2.3+. The changes to Lucene that tickled this JRE bug were bulk-merging of stored fields:

        https://issues.apache.org/jira/browse/LUCENE-1043

        which landed in 2.3, and also bulk-merging of term vectors:

        https://issues.apache.org/jira/browse/LUCENE-1120

        Show
        Michael McCandless added a comment - Can anyone comment as to whether the JRE 1.6.04+ bug affects any earlier versions of Lucene? (say, 2.0.. which we're still using). As far as I know, this corruption only happens on Lucene 2.3+. The changes to Lucene that tickled this JRE bug were bulk-merging of stored fields: https://issues.apache.org/jira/browse/LUCENE-1043 which landed in 2.3, and also bulk-merging of term vectors: https://issues.apache.org/jira/browse/LUCENE-1120
        Hide
        Ismael Juma added a comment -

        As can be seen in the Sun database a fix for this has been committed to OpenJDK and they're looking into backporting it into Java 6 Update 10.

        Show
        Ismael Juma added a comment - As can be seen in the Sun database a fix for this has been committed to OpenJDK and they're looking into backporting it into Java 6 Update 10.
        Hide
        Ismael Juma added a comment -

        The latest build of JDK 6 Update 10 (b28) includes the fix for this. It can be downloaded from:

        http://download.java.net/jdk6/binaries/

        In the summary of changes, you can see that it refers to a bug that requests the integration of a new HotSpot build that includes the fix for this:

        http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6727119

        I have also verified that the test-case now passes on my machine.

        Show
        Ismael Juma added a comment - The latest build of JDK 6 Update 10 (b28) includes the fix for this. It can be downloaded from: http://download.java.net/jdk6/binaries/ In the summary of changes, you can see that it refers to a bug that requests the integration of a new HotSpot build that includes the fix for this: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6727119 I have also verified that the test-case now passes on my machine.
        Hide
        Michael McCandless added a comment -

        Indeed, I can confirm that JDK 6 Update 10 (b28) fixes my 19 GB test case that reliably crashes with earlier JDK 6 versions.

        I'll resolve this as fixed, and send an email to users.

        Show
        Michael McCandless added a comment - Indeed, I can confirm that JDK 6 Update 10 (b28) fixes my 19 GB test case that reliably crashes with earlier JDK 6 versions. I'll resolve this as fixed, and send an email to users.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            5 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development