Lucene - Core
  1. Lucene - Core
  2. LUCENE-1995

ArrayIndexOutOfBoundsException during indexing

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9.1
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Activity

      Hide
      Yonik Seeley added a comment -

      The point at the exception uses a signed shift instead of unsigned, but that shouldn't matter unless the buffer pool is huge?
      Aaron, what are your index settings (like ramBufferSizeMB?)

      Show
      Yonik Seeley added a comment - The point at the exception uses a signed shift instead of unsigned, but that shouldn't matter unless the buffer pool is huge? Aaron, what are your index settings (like ramBufferSizeMB?)
      Hide
      Michael McCandless added a comment -

      Spooky! It does look likely we overflowed int, because (1 + Integer.MAX_VALUE) >> 15 is -65536.

      Show
      Michael McCandless added a comment - Spooky! It does look likely we overflowed int, because (1 + Integer.MAX_VALUE) >> 15 is -65536.
      Hide
      Aaron McKee added a comment -

      I make no claims to the reasonableness of these settings, I only recently began efforts to tune our prototype. =)

      useCompoundFile: false
      mergeFactor: 10
      maxBufferedDocs: 5000000
      ramBufferSizeMB: 8192
      maxFieldLength: 10000
      reopenReaders: true

      My system has 24gb and my index is typically ~16gb, so I set some of these values a bit high. If the ram buffer is being indexed with an int, that could certainly be my issue; I feel a bit silly for not having thought of that, already. I'll try setting it down to 2048 and see if the problem disappears.

      Show
      Aaron McKee added a comment - I make no claims to the reasonableness of these settings, I only recently began efforts to tune our prototype. =) useCompoundFile: false mergeFactor: 10 maxBufferedDocs: 5000000 ramBufferSizeMB: 8192 maxFieldLength: 10000 reopenReaders: true My system has 24gb and my index is typically ~16gb, so I set some of these values a bit high. If the ram buffer is being indexed with an int, that could certainly be my issue; I feel a bit silly for not having thought of that, already. I'll try setting it down to 2048 and see if the problem disappears.
      Hide
      Yonik Seeley added a comment -

      lol - well, there we go. Looks like perhaps a JavaDoc fix (and a comment in solrconfig.xml)? The buffered size was never meant to be quite so large

      Mike - I think keeping the signed shift is the right thing to do... a zero-cost check against silent corruption.
      But I'm not sure if 2048MiB is safe either... I'm not sure of one could overflow the number of buffers somehow as well (is every buffer except the last fully utilized?)

      Show
      Yonik Seeley added a comment - lol - well, there we go. Looks like perhaps a JavaDoc fix (and a comment in solrconfig.xml)? The buffered size was never meant to be quite so large Mike - I think keeping the signed shift is the right thing to do... a zero-cost check against silent corruption. But I'm not sure if 2048MiB is safe either... I'm not sure of one could overflow the number of buffers somehow as well (is every buffer except the last fully utilized?)
      Hide
      Michael McCandless added a comment -

      That's a nice large RAM buffer

      Mike - I think keeping the signed shift is the right thing to do... a zero-cost check against silent corruption.

      Ahh good point, OK we'll keep it as is.

      But I'm not sure if 2048MiB is safe either

      2048 probably won't be safe, because a large doc just as the buffer is filling up could still overflow. (Though, RAM is also used eg for norms, so you might squeak by).

      I'll update the javadocs to note the limitation!

      Show
      Michael McCandless added a comment - That's a nice large RAM buffer Mike - I think keeping the signed shift is the right thing to do... a zero-cost check against silent corruption. Ahh good point, OK we'll keep it as is. But I'm not sure if 2048MiB is safe either 2048 probably won't be safe, because a large doc just as the buffer is filling up could still overflow. (Though, RAM is also used eg for norms, so you might squeak by). I'll update the javadocs to note the limitation!
      Hide
      Michael McCandless added a comment -

      Thanks Aaron! Maybe someday Lucene will allow a larger RAM buffer than 2GB...

      Show
      Michael McCandless added a comment - Thanks Aaron! Maybe someday Lucene will allow a larger RAM buffer than 2GB...
      Hide
      Fuad Efendi added a comment -

      I am recalling a bug in Arrays.sort() (Joshua Bloch) which was fixed after 9 years; "signed" instead of "unsigned"...

      Show
      Fuad Efendi added a comment - I am recalling a bug in Arrays.sort() (Joshua Bloch) which was fixed after 9 years; "signed" instead of "unsigned"...
      Hide
      Mark Miller added a comment -

      If your talking about the merge sort/binary sort fix, that was Martin Buchholz. Joshua Bloch just helped spread the word.

      Speaking of which, there is another one of these in the flex branch

      SimpleStandardTermsIndexReader, binary search

      Show
      Mark Miller added a comment - If your talking about the merge sort/binary sort fix, that was Martin Buchholz. Joshua Bloch just helped spread the word. Speaking of which, there is another one of these in the flex branch SimpleStandardTermsIndexReader, binary search
      Hide
      Fuad Efendi added a comment -

      Joshua writes in his Google Research Blog:
      "The version of binary search that I wrote for the JDK contained the same bug. It was reported to Sun recently when it broke someone's program, after lying in wait for nine years or so."
      http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

      Anyway, this is specific use case of reporter; I didn't have ANY problems with ramBufferSizeMB: 8192 during a month (at least) of constant updates (5000/sec)... Yes, I am using term vectors (as Michael niticed it plays a role)...

      And what exactly causes the problem is unclear; having explicit check for 2048 is just workaround... quick shortcut...

      Show
      Fuad Efendi added a comment - Joshua writes in his Google Research Blog: "The version of binary search that I wrote for the JDK contained the same bug. It was reported to Sun recently when it broke someone's program, after lying in wait for nine years or so." http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html Anyway, this is specific use case of reporter; I didn't have ANY problems with ramBufferSizeMB: 8192 during a month (at least) of constant updates (5000/sec)... Yes, I am using term vectors (as Michael niticed it plays a role)... And what exactly causes the problem is unclear; having explicit check for 2048 is just workaround... quick shortcut...
      Hide
      Mark Miller added a comment -

      Joshua writes in his Google Research Blog:
      "The version of binary search that I wrote for the JDK contained the same bug. It was reported to Sun recently when it broke someone's program, after lying in wait for nine years or so."
      http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html

      Right - thats Joshua spreading the word. The guy who found the bug also gave the implemented fix. Hence, to him goes the credit of both the bug find and the fix. Simple as that.

      Show
      Mark Miller added a comment - Joshua writes in his Google Research Blog: "The version of binary search that I wrote for the JDK contained the same bug. It was reported to Sun recently when it broke someone's program, after lying in wait for nine years or so." http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html Right - thats Joshua spreading the word. The guy who found the bug also gave the implemented fix. Hence, to him goes the credit of both the bug find and the fix. Simple as that.
      Hide
      Fuad Efendi added a comment -

      But who did the bug? Joshua writes, it's him - based on other's famous findings and books...
      ===
      " it just contains a few lines of code that calculates a
      double value from two document fields and then stores that value in one of these dynamic fields"
      And problem happens when he indexes document number 15,000,000...

      • I am guessing he is indexing "double"... ((type=tdouble, indexed=t, stored=f)... Why do we ever need to index multi-valued field "double"? Cardinality is the highest possible... I don't know Lucene internals; I am thinking that (double, docID) will occupy 12 bytes, and with multivalued (or dynamic) field we may need a lot of RAM for 15 mlns docs... especially if we are trying to put into buskets some objects using hash of "double"...
      Show
      Fuad Efendi added a comment - But who did the bug? Joshua writes, it's him - based on other's famous findings and books... === " it just contains a few lines of code that calculates a double value from two document fields and then stores that value in one of these dynamic fields" And problem happens when he indexes document number 15,000,000... I am guessing he is indexing "double"... ((type=tdouble, indexed=t, stored=f)... Why do we ever need to index multi-valued field "double"? Cardinality is the highest possible... I don't know Lucene internals; I am thinking that (double, docID) will occupy 12 bytes, and with multivalued (or dynamic) field we may need a lot of RAM for 15 mlns docs... especially if we are trying to put into buskets some objects using hash of "double"...
      Hide
      Yonik Seeley added a comment -

      Anyway, this is specific use case of reporter; I didn't have ANY problems with ramBufferSizeMB: 8192

      We've been over this on solr-user. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048.

      And what exactly causes the problem is unclear; having explicit check for 2048 is just workaround... quick shortcut...

      No, we only support a max of 2GB ram buffer, by design currently. So the explicit check is so you get the error immediately instead of far into an indexing process.

      Show
      Yonik Seeley added a comment - Anyway, this is specific use case of reporter; I didn't have ANY problems with ramBufferSizeMB: 8192 We've been over this on solr-user. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048. And what exactly causes the problem is unclear; having explicit check for 2048 is just workaround... quick shortcut... No, we only support a max of 2GB ram buffer, by design currently. So the explicit check is so you get the error immediately instead of far into an indexing process.
      Hide
      Mark Miller added a comment -

      But who did the bug? Joshua writes, it's him

      Wow you like being obstinate. He put the code in Sun's JVM, but he didn't come up with the algorithm. He took it, and the bug with it, from elsewhere. He didn't "do" the bug either. He just propagated it.

      Show
      Mark Miller added a comment - But who did the bug? Joshua writes, it's him Wow you like being obstinate. He put the code in Sun's JVM, but he didn't come up with the algorithm. He took it, and the bug with it, from elsewhere. He didn't "do" the bug either. He just propagated it.
      Hide
      Fuad Efendi added a comment -

      He took it, and the bug with it, from elsewhere. He didn't "do" the bug either. He just propagated it.

      This is even worse. Especially for such classic case as Arrays.sort(). Buggy propagating...

      • The sorting algorithm is a tuned quicksort, adapted from Jon
      • L. Bentley and M. Douglas McIlroy's "Engineering a Sort Function",
      • Software-Practice and Experience, Vol. 23(11) P. 1249-1265 (November
      • 1993). This algorithm offers n*log performance on many data sets
      • that cause other quicksorts to degrade to quadratic performance.

      If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048.

      Now I think my actual usage was below 2Gb...

      No, we only support a max of 2GB ram buffer, by design currently.

      Thanks for confirmation... However, JavaDoc didn't mention explicitly that, and "by design" is unclear wordings... it's already several years "by design"...

      2048 probably won't be safe, because a large doc just as the buffer is filling up could still overflow. (Though, RAM is also used eg for norms, so you might squeak by).

      • Uncertainness...
      Show
      Fuad Efendi added a comment - He took it, and the bug with it, from elsewhere. He didn't "do" the bug either. He just propagated it. This is even worse. Especially for such classic case as Arrays.sort(). Buggy propagating... The sorting algorithm is a tuned quicksort, adapted from Jon L. Bentley and M. Douglas McIlroy's "Engineering a Sort Function", Software-Practice and Experience, Vol. 23(11) P. 1249-1265 (November 1993). This algorithm offers n*log performance on many data sets that cause other quicksorts to degrade to quadratic performance. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048. Now I think my actual usage was below 2Gb... No, we only support a max of 2GB ram buffer, by design currently. Thanks for confirmation... However, JavaDoc didn't mention explicitly that, and "by design" is unclear wordings... it's already several years "by design"... 2048 probably won't be safe, because a large doc just as the buffer is filling up could still overflow. (Though, RAM is also used eg for norms, so you might squeak by). Uncertainness...
      Hide
      Fuad Efendi added a comment - - edited

      bq. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048.

      Now I think my actual usage was below 2Gb...

      How I was below 2048 if I had few segments created by IndexWriter during a day, without any "SOLR-commit"?.. may be I am wrong, it was few weeks ago... I am currently using 1024 because I need memory for other staff too, and I don't want to try again...

      Show
      Fuad Efendi added a comment - - edited bq. If your usage actually went above 2GB, you would have had a problem. 8192 is not a valid value, we don't support it, and now we'll throw an exception if it's over 2048. Now I think my actual usage was below 2Gb... How I was below 2048 if I had few segments created by IndexWriter during a day, without any "SOLR-commit"?.. may be I am wrong, it was few weeks ago... I am currently using 1024 because I need memory for other staff too, and I don't want to try again...
      Hide
      Michael McCandless added a comment -

      Bulk close all 2.9.1 issues.

      Show
      Michael McCandless added a comment - Bulk close all 2.9.1 issues.

        People

        • Assignee:
          Michael McCandless
          Reporter:
          Yonik Seeley
        • Votes:
          0 Vote for this issue
          Watchers:
          1 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development