Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7440

Document skipping on large indexes is broken

    Details

    • Lucene Fields:
      New

      Description

      Large skips on large indexes fail.
      Anything that uses skips (such as a boolean query, filtered queries, faceted queries, join queries, etc) can trigger this bug on a sufficiently large index.

      The bug is a numeric overflow in MultiLevelSkipList that has been present since inception (Lucene 2.2). It may not manifest until one has a single segment with more than ~1.8B documents, and a large skip is performed on that segment.

      Typical stack trace on Lucene7-dev:

      java.lang.ArrayIndexOutOfBoundsException: 110
      	at org.apache.lucene.codecs.MultiLevelSkipListReader$SkipBuffer.readByte(MultiLevelSkipListReader.java:297)
      	at org.apache.lucene.store.DataInput.readVInt(DataInput.java:125)
      	at org.apache.lucene.codecs.lucene50.Lucene50SkipReader.readSkipData(Lucene50SkipReader.java:180)
      	at org.apache.lucene.codecs.MultiLevelSkipListReader.loadNextSkip(MultiLevelSkipListReader.java:163)
      	at org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:133)
      	at org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$BlockDocsEnum.advance(Lucene50PostingsReader.java:421)
      	at YCS_skip7$1.testSkip(YCS_skip7.java:307)
      

      Typical stack trace on Lucene4.10.3:

      6-08-31 18:57:17,460 ERROR org.apache.solr.servlet.SolrDispatchFilter: null:java.lang.ArrayIndexOutOfBoundsException: 75
       at org.apache.lucene.codecs.MultiLevelSkipListReader$SkipBuffer.readByte(MultiLevelSkipListReader.java:301)
       at org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
       at org.apache.lucene.codecs.lucene41.Lucene41SkipReader.readSkipData(Lucene41SkipReader.java:194)
       at org.apache.lucene.codecs.MultiLevelSkipListReader.loadNextSkip(MultiLevelSkipListReader.java:168)
       at org.apache.lucene.codecs.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:138)
       at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.advance(Lucene41PostingsReader.java:506)
       at org.apache.lucene.search.TermScorer.advance(TermScorer.java:85)
      [...]
       at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
      [...]
       at org.apache.solr.core.SolrCore.execute(SolrCore.java:2004)
      
      1. LUCENE-7440.patch
        6 kB
        Yonik Seeley
      2. LUCENE-7440.patch
        7 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          dweiss Dawid Weiss added a comment -

          Yonik, what exactly is broken? Do you have a stack trace or repro code?

          Show
          dweiss Dawid Weiss added a comment - Yonik, what exactly is broken? Do you have a stack trace or repro code?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Patch coming... was waiting for tests to pass.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Patch coming... was waiting for tests to pass.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Here's a patch, and a test that fails w/o the patch (takes 30min on my old linux box).

          There are two obvious ways to fix:

          • use long[] instead of int[] for numSkipped
          • check numSkipped for overflow and compensate where used

          I used the latter in this patch since it avoids increasing the memory footprint and there was only a single place to check for overflow. Subsequent uses of the overflowed value were already automatically corrected by subtracting the same value that caused the overflow in the first place.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Here's a patch, and a test that fails w/o the patch (takes 30min on my old linux box). There are two obvious ways to fix: use long[] instead of int[] for numSkipped check numSkipped for overflow and compensate where used I used the latter in this patch since it avoids increasing the memory footprint and there was only a single place to check for overflow. Subsequent uses of the overflowed value were already automatically corrected by subtracting the same value that caused the overflow in the first place.
          Hide
          thetaphi Uwe Schindler added a comment -

          Nice findings. A really old bug!

          Show
          thetaphi Uwe Schindler added a comment - Nice findings. A really old bug!
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Regarding the 1.8B docs number... at least in my tests I saw the top-level skip distance of ~268M w/ the default codec. Subtracting this from MAX_INT gives around 1.8B, which is around the number I saw prior to the overflow. To hit the bug, one also needs to be doing large skips toward the end of the index as well, in order to use the top level(s) of the multi-level skip list. Having a conjunction query of a highly unique term (or clause) in conjunction with a common term has a good chance of triggering (example: +timestamp:39520928456494 +doctype:common)

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Regarding the 1.8B docs number... at least in my tests I saw the top-level skip distance of ~268M w/ the default codec. Subtracting this from MAX_INT gives around 1.8B, which is around the number I saw prior to the overflow. To hit the bug, one also needs to be doing large skips toward the end of the index as well, in order to use the top level(s) of the multi-level skip list. Having a conjunction query of a highly unique term (or clause) in conjunction with a common term has a good chance of triggering (example: +timestamp:39520928456494 +doctype:common)
          Hide
          thetaphi Uwe Schindler added a comment -

          Fix with negative check looks fine to me, this should be cheap. I'd suggest that Michael McCandless also looks into it, he knows this code better.

          Show
          thetaphi Uwe Schindler added a comment - Fix with negative check looks fine to me, this should be cheap. I'd suggest that Michael McCandless also looks into it, he knows this code better.
          Hide
          mikemccand Michael McCandless added a comment -

          What a nice find!

          I would rather we just use long[] here? The added memory cost is minor, and I think understandable code is more important.

          Show
          mikemccand Michael McCandless added a comment - What a nice find! I would rather we just use long[] here? The added memory cost is minor, and I think understandable code is more important.
          Hide
          rcmuir Robert Muir added a comment -

          Another alternative would be to use Integer.compareUnsigned()

          Show
          rcmuir Robert Muir added a comment - Another alternative would be to use Integer.compareUnsigned()
          Hide
          mikemccand Michael McCandless added a comment -

          Another alternative would be to use Integer.compareUnsigned()

          +1

          These values should never overflow 2^32

          Show
          mikemccand Michael McCandless added a comment - Another alternative would be to use Integer.compareUnsigned() +1 These values should never overflow 2^32
          Hide
          thetaphi Uwe Schindler added a comment -

          +1 compareUnsigned is the way to go! It's an intrinsic, too.

          Show
          thetaphi Uwe Schindler added a comment - +1 compareUnsigned is the way to go! It's an intrinsic, too.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Ok, I did a quick review to ensure that the other operand in the comparison, docCount, would never legitimately be negative and then switched to compareUnsigned.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Ok, I did a quick review to ensure that the other operand in the comparison, docCount, would never legitimately be negative and then switched to compareUnsigned.
          Hide
          thetaphi Uwe Schindler added a comment -

          +1

          Show
          thetaphi Uwe Schindler added a comment - +1
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c929d0595c0ad2ef311054746dc24aa8704f55e6 in lucene-solr's branch refs/heads/master from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c929d05 ]

          LUCENE-7440: fix MultiLevelSkipListReader overflow

          Show
          jira-bot ASF subversion and git services added a comment - Commit c929d0595c0ad2ef311054746dc24aa8704f55e6 in lucene-solr's branch refs/heads/master from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c929d05 ] LUCENE-7440 : fix MultiLevelSkipListReader overflow
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit cf72eebf75bb3c5c3bdd8ef2ee288e67f89b5538 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf72eeb ]

          LUCENE-7440: fix MultiLevelSkipListReader overflow

          Show
          jira-bot ASF subversion and git services added a comment - Commit cf72eebf75bb3c5c3bdd8ef2ee288e67f89b5538 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cf72eeb ] LUCENE-7440 : fix MultiLevelSkipListReader overflow
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Re-opened to back-port to 6.2.1

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Re-opened to back-port to 6.2.1
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c7b3e9ae3695a13dacb81312db0d470ada273808 in lucene-solr's branch refs/heads/branch_6_2 from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c7b3e9a ]

          LUCENE-7440: fix MultiLevelSkipListReader overflow

          (cherry picked from commit cf72eeb)

          Show
          jira-bot ASF subversion and git services added a comment - Commit c7b3e9ae3695a13dacb81312db0d470ada273808 in lucene-solr's branch refs/heads/branch_6_2 from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c7b3e9a ] LUCENE-7440 : fix MultiLevelSkipListReader overflow (cherry picked from commit cf72eeb)
          Hide
          mdrob Mike Drob added a comment -

          Regarding the 1.8B docs number... at least in my tests I saw the top-level skip distance of ~268M w/ the default codec. Subtracting this from MAX_INT gives around 1.8B, which is around the number I saw prior to the overflow. To hit the bug, one also needs to be doing large skips toward the end of the index as well, in order to use the top level(s) of the multi-level skip list. Having a conjunction query of a highly unique term (or clause) in conjunction with a common term has a good chance of triggering (example: +timestamp:39520928456494 +doctype:common)

          Would this be faster to test if we configure a larger top-level skip distance? i.e. set up a skip distance of ~1B and then we'd only need to get to ~1.1B docs indexed (40% fewer docs, theoretically 40% faster?) or even set up a skip distance of ~2B to only need to index very few documents?

          Maybe this idea should be split into a separate issue to focus on improving the test?

          Show
          mdrob Mike Drob added a comment - Regarding the 1.8B docs number... at least in my tests I saw the top-level skip distance of ~268M w/ the default codec. Subtracting this from MAX_INT gives around 1.8B, which is around the number I saw prior to the overflow. To hit the bug, one also needs to be doing large skips toward the end of the index as well, in order to use the top level(s) of the multi-level skip list. Having a conjunction query of a highly unique term (or clause) in conjunction with a common term has a good chance of triggering (example: +timestamp:39520928456494 +doctype:common) Would this be faster to test if we configure a larger top-level skip distance? i.e. set up a skip distance of ~1B and then we'd only need to get to ~1.1B docs indexed (40% fewer docs, theoretically 40% faster?) or even set up a skip distance of ~2B to only need to index very few documents? Maybe this idea should be split into a separate issue to focus on improving the test?
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Would this be faster to test if we configure a larger top-level skip distance?

          The top-level skip distance sort of falls out from other factors, rather than being explicitly configured.
          For a quicker, more thorough test, It would probably be good to somehow test the skip list logic itself w/o having it backed by an actual index. Even with that, I think it's a good idea to also test real indexes.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Would this be faster to test if we configure a larger top-level skip distance? The top-level skip distance sort of falls out from other factors, rather than being explicitly configured. For a quicker, more thorough test, It would probably be good to somehow test the skip list logic itself w/o having it backed by an actual index. Even with that, I think it's a good idea to also test real indexes.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.2.1 release

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.2.1 release
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1c219d2da692d8b41c02d8a52ffacd813cd44a7b in lucene-solr's branch refs/heads/branch_5_5 from Yonik Seeley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1c219d2 ]

          LUCENE-7440: fix MultiLevelSkipListReader overflow

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1c219d2da692d8b41c02d8a52ffacd813cd44a7b in lucene-solr's branch refs/heads/branch_5_5 from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1c219d2 ] LUCENE-7440 : fix MultiLevelSkipListReader overflow
          Hide
          jpountz Adrien Grand added a comment -

          Reopen for backport to 5.5.4.

          Show
          jpountz Adrien Grand added a comment - Reopen for backport to 5.5.4.
          Hide
          jpountz Adrien Grand added a comment -

          For the record, I verified that Test2BPostings passes on the 5.5 branch.

          Show
          jpountz Adrien Grand added a comment - For the record, I verified that Test2BPostings passes on the 5.5 branch.

            People

            • Assignee:
              yseeley@gmail.com Yonik Seeley
              Reporter:
              yseeley@gmail.com Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development