Lucene - Core
  1. Lucene - Core
  2. LUCENE-6192

Long overflow in LuceneXXSkipWriter can corrupt skip data

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.5, 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I've been iterating with Tom on this corruption that CheckIndex detects in his rather large index (720 GB in a single segment):

       java -Xmx16G -Xms16G -cp $JAR -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /XXXX/shards/4/core-1/data/test_index -verbose 2>&1 |tee -a shard4_reoptimizedNewJava
      
      
      Opening index @ /htsolr/lss-reindex/shards/4/core-1/data/test_index
      
      Segments file=segments_e numSegments=1 version=4.10.2 format= userData={commitTimeMSec=1421479358825}
        1 of 1: name=_8m8 docCount=1130856
          version=4.10.2
          codec=Lucene410
          compound=false
          numFiles=10
          size (MB)=719,967.32
          diagnostics = {timestamp=1421437320935, os=Linux, os.version=2.6.18-400.1.1.el5, mergeFactor=2, source=merge, lucene.version=4.10.2, os.arch=amd64, mergeMaxNumSegments=1, java.version=1.7.0_71, java.vendor=Oracle Corporation}
          no deletions
          test: open reader.........OK
          test: check integrity.....OK
          test: check live docs.....OK
          test: fields..............OK [80 fields]
          test: field norms.........OK [23 fields]
          test: terms, freq, prox...ERROR: java.lang.AssertionError: -96
      java.lang.AssertionError: -96
              at org.apache.lucene.codecs.lucene41.ForUtil.skipBlock(ForUtil.java:228)
              at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsAndPositionsEnum.skipPositions(Lucene41PostingsReader.java:925)
              at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsAndPositionsEnum.nextPosition(Lucene41PostingsReader.java:955)
              at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1100)
              at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1357)
              at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:655)
              at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2096)
          test: stored fields.......OK [67472796 total field count; avg 59.665 fields per doc]
          test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
          test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
      FAILED
          WARNING: fixIndex() would remove reference to this segment; full exception:
      java.lang.RuntimeException: Term Index test failed
              at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:670)
              at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2096)
      
      WARNING: 1 broken segments (containing 1130856 documents) detected
      WARNING: would write new segments file, and 1130856 documents would be lost, if -fix were specified
      

      And Rob spotted long -> int casts in our skip list writers that look like they could cause such corruption if a single high-freq term with many positions required > 2.1 GB to write its positions into .pos.

      1. LUCENE-6192.patch
        2 kB
        Michael McCandless

        Activity

        Hide
        Robert Muir added a comment - - edited

        The bug is actually old (even in 3.x) but in 4.x we added "skipMultiplier" which means we write less skipdata and it uncovers it. it just takes a > 2.1GB delta at a higher level to screw things up.

        Tom's blog post here mentions > 2GB .pos data for "the".

        edit: add url http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1

        Show
        Robert Muir added a comment - - edited The bug is actually old (even in 3.x) but in 4.x we added "skipMultiplier" which means we write less skipdata and it uncovers it. it just takes a > 2.1GB delta at a higher level to screw things up. Tom's blog post here mentions > 2GB .pos data for "the". edit: add url http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
        Hide
        Michael McCandless added a comment -

        Here's a patch, only for 4.10.x for starters (Lucene41SkipWriter) so Tom can test it and see if it fixes his exception. The good thing about skip data is it's ignored during merging, so to test this you just need to apply the patch, compile & deploy Lucene core JAR, then optimize so the skip data is regenerated...

        On 5.x/trunk we must also fix Lucene50SkipWriter...

        Show
        Michael McCandless added a comment - Here's a patch, only for 4.10.x for starters (Lucene41SkipWriter) so Tom can test it and see if it fixes his exception. The good thing about skip data is it's ignored during merging, so to test this you just need to apply the patch, compile & deploy Lucene core JAR, then optimize so the skip data is regenerated... On 5.x/trunk we must also fix Lucene50SkipWriter...
        Hide
        Ryan Ernst added a comment -

        +1 to the patch.

        And just to confirm, we shouldn't need a format change, since this is a bug in writing, and the reader isn't really changing (vlong is a superset of vint)?

        Show
        Ryan Ernst added a comment - +1 to the patch. And just to confirm, we shouldn't need a format change, since this is a bug in writing, and the reader isn't really changing (vlong is a superset of vint)?
        Hide
        Robert Muir added a comment -

        +1. Ryan: yes, thats the case.

        Show
        Robert Muir added a comment - +1. Ryan: yes, thats the case.
        Hide
        ASF subversion and git services added a comment -

        Commit 1653577 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1653577 ]

        LUCENE-6192: don't overflow int when writing skip data for high freq terms in extremely large indices

        Show
        ASF subversion and git services added a comment - Commit 1653577 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1653577 ] LUCENE-6192 : don't overflow int when writing skip data for high freq terms in extremely large indices
        Hide
        ASF subversion and git services added a comment -

        Commit 1653580 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1653580 ]

        LUCENE-6192: don't overflow int when writing skip data for high freq terms in extremely large indices

        Show
        ASF subversion and git services added a comment - Commit 1653580 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1653580 ] LUCENE-6192 : don't overflow int when writing skip data for high freq terms in extremely large indices
        Hide
        ASF subversion and git services added a comment -

        Commit 1653585 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1653585 ]

        LUCENE-6192: don't overflow int when writing skip data for high freq terms in extremely large indices

        Show
        ASF subversion and git services added a comment - Commit 1653585 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1653585 ] LUCENE-6192 : don't overflow int when writing skip data for high freq terms in extremely large indices
        Hide
        ASF subversion and git services added a comment -

        Commit 1653588 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1653588 ]

        LUCENE-6192: don't overflow int when writing skip data for high freq terms in extremely large indices

        Show
        ASF subversion and git services added a comment - Commit 1653588 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1653588 ] LUCENE-6192 : don't overflow int when writing skip data for high freq terms in extremely large indices
        Hide
        ASF subversion and git services added a comment -

        Commit 1653593 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1653593 ]

        LUCENE-6192: don't overflow int when writing skip data for high freq terms in extremely large indices

        Show
        ASF subversion and git services added a comment - Commit 1653593 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1653593 ] LUCENE-6192 : don't overflow int when writing skip data for high freq terms in extremely large indices
        Hide
        ASF subversion and git services added a comment -

        Commit 1653594 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1653594 ]

        LUCENE-6192: don't overflow int when writing skip data for high freq terms in extremely large indices

        Show
        ASF subversion and git services added a comment - Commit 1653594 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1653594 ] LUCENE-6192 : don't overflow int when writing skip data for high freq terms in extremely large indices
        Hide
        Michael McCandless added a comment -

        Resolving ... Tom can you post back here the results of testing with this fix? Thanks. Hopefully this is the bug you were hitting!

        Show
        Michael McCandless added a comment - Resolving ... Tom can you post back here the results of testing with this fix? Thanks. Hopefully this is the bug you were hitting!
        Hide
        ASF subversion and git services added a comment -

        Commit 1653606 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1653606 ]

        LUCENE-6192: don't overflow int when writing skip data for high freq terms in extremely large indices

        Show
        ASF subversion and git services added a comment - Commit 1653606 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1653606 ] LUCENE-6192 : don't overflow int when writing skip data for high freq terms in extremely large indices
        Hide
        Tom Burton-West added a comment -

        I'll report as soon as I have some results. Still have about 10% (about 1.3 million books or slightly less than a terabyte of OCR) to index. Once that is done we will deploy a Solr war with the patch and optimize. That will take overnight. When the optimize is done we will then run CheckIndex. So hopefully by Friday I will have something to report.

        Show
        Tom Burton-West added a comment - I'll report as soon as I have some results. Still have about 10% (about 1.3 million books or slightly less than a terabyte of OCR) to index. Once that is done we will deploy a Solr war with the patch and optimize. That will take overnight. When the optimize is done we will then run CheckIndex. So hopefully by Friday I will have something to report.
        Hide
        Tom Burton-West added a comment -

        Patch works! Thanks Mike!

        Deployed Solr war with the patch and ran optimize on 12 shards. All CheckIndexes passed.
        Below are some of the stats on one of the shards.

        Tom

        About 1 million docs and 700GB index with about 4 billion unique terms, 270 billion tokens

        docCount=1086381
        size (MB)=693,308.47

        test: terms, freq, prox...OK [4113882974 terms; 61631126560 terms/docs pairs; 270670957886 tokens]

        field "ocr":
        index FST:
        27157406 nodes
        77300582 arcs
        1262090664 bytes
        terms:
        4087713620 terms
        50227574334 bytes (12.3 bytes/term)
        blocks:
        132202225 blocks
        96419097 terms-only blocks
        40757 sub-block-only blocks
        35742371 mixed blocks
        27202047 floor blocks
        44718055 non-floor blocks
        87484170 floor sub-blocks
        23560113026 term suffix bytes (178.2 suffix-bytes/block)
        8227225977 term stats bytes (62.2 stats-bytes/block)
        19664735257 other bytes (148.7 other-bytes/block)
        by prefix length:

        Show
        Tom Burton-West added a comment - Patch works! Thanks Mike! Deployed Solr war with the patch and ran optimize on 12 shards. All CheckIndexes passed. Below are some of the stats on one of the shards. Tom About 1 million docs and 700GB index with about 4 billion unique terms, 270 billion tokens docCount=1086381 size (MB)=693,308.47 test: terms, freq, prox...OK [4113882974 terms; 61631126560 terms/docs pairs; 270670957886 tokens] field "ocr": index FST: 27157406 nodes 77300582 arcs 1262090664 bytes terms: 4087713620 terms 50227574334 bytes (12.3 bytes/term) blocks: 132202225 blocks 96419097 terms-only blocks 40757 sub-block-only blocks 35742371 mixed blocks 27202047 floor blocks 44718055 non-floor blocks 87484170 floor sub-blocks 23560113026 term suffix bytes (178.2 suffix-bytes/block) 8227225977 term stats bytes (62.2 stats-bytes/block) 19664735257 other bytes (148.7 other-bytes/block) by prefix length:
        Hide
        Michael McCandless added a comment -

        OK that's great news; thanks for bringing closure Tom!

        Show
        Michael McCandless added a comment - OK that's great news; thanks for bringing closure Tom!
        Hide
        ASF subversion and git services added a comment -

        Commit 1655678 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0'
        [ https://svn.apache.org/r1655678 ]

        LUCENE-6192: add CHANGES entry

        Show
        ASF subversion and git services added a comment - Commit 1655678 from Michael McCandless in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1655678 ] LUCENE-6192 : add CHANGES entry
        Hide
        ASF subversion and git services added a comment -

        Commit 1655681 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1655681 ]

        LUCENE-6192: add CHANGES entry

        Show
        ASF subversion and git services added a comment - Commit 1655681 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1655681 ] LUCENE-6192 : add CHANGES entry
        Hide
        ASF subversion and git services added a comment -

        Commit 1655682 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1655682 ]

        LUCENE-6192: add CHANGES entry

        Show
        ASF subversion and git services added a comment - Commit 1655682 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1655682 ] LUCENE-6192 : add CHANGES entry
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development