Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-4114

Cached lucene index gets corrupted in case of unclean shutdown and journal rollback in SegmentNodeStore

    XMLWordPrintableJSON

Details

    Description

      Currently Oak Lucene support would copy index files to local file system as part of CopyOnRead feature. In one of the setup it has been observed that index logic was failing with following error

      04.02.2016 17:47:52.391 *WARN* [oak-lucene-3] org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier [/oak:index/lucene] Found local copy for _2ala.cfs in MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 but size of local 9320 differs from remote 3714150. Content would be read from remote file only
      04.02.2016 17:47:52.399 *WARN* [oak-lucene-3] org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier [/oak:index/lucene] Found local copy for segments_28je in MMapDirectory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 lockFactory=NativeFSLockFactory@/mnt/crx/author/crx-quickstart/repository/index/e5a943cdec3000bd8ce54924fd2070ab5d1d35b9ecf530963a3583d43bf28293/1 but size of local 1214 differs from remote 1175. Content would be read from remote file only
      04.02.2016 17:47:52.491 *ERROR* [oak-lucene-3] org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker Failed to open Lucene index at /oak:index/lucene
      org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1953790076 vs expected header=1071082519 (resource: SlicedIndexInput(SlicedIndexInput(_2ala.fnm in _2ala.cfs) in _2ala.cfs slice=8810:9320))
      	at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128)
      	at org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:56)
      	at org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:215)
      

      Here size of _2ala.cfs differed from remote copy and possible other index file may have same size but different content. Comparing the modified time of the files with those in Oak it can be seen that one of file system was older than one in Oak

      _2alr.cfs={name=_2alr.cfs, size=1152402, sizeStr=1.2 MB, modified=Thu Feb 04 17:52:31 GMT 2016, osModified=Feb 4 17:52, osSize=1152402, mismatch=false}
      _2ala.cfe={name=_2ala.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=224, mismatch=false}
      _2ala.si={name=_2ala.si, size=252, sizeStr=252 B, modified=Thu Feb 04 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=252, mismatch=false}
      _2ala.cfs={name=_2ala.cfs, size=3714150, sizeStr=3.7 MB, modified=Thu Feb 04 17:47:28 GMT 2016, osModified=Feb 4 17:17, osSize=9320, mismatch=true}
      _14u3_29.del={name=_14u3_29.del, size=1244036, sizeStr=1.2 MB, modified=Thu Feb 04 16:37:35 GMT 2016, osModified=Feb 4 16:37, osSize=1244036, mismatch=false}
      _2akw.si={name=_2akw.si, size=252, sizeStr=252 B, modified=Thu Feb 04 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=252, mismatch=false}
      _2akw.cfe={name=_2akw.cfe, size=224, sizeStr=224 B, modified=Thu Feb 04 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=224, mismatch=false}
      _2akw.cfs={name=_2akw.cfs, size=4952761, sizeStr=5.0 MB, modified=Thu Feb 04 16:37:07 GMT 2016, osModified=Feb 4 16:37, osSize=4952761, mismatch=false}
      

      And on same setup the system did saw a rollback in segment node store

      -rw-rw-r--. 1 crx crx  25961984 Feb  4 16:47 data01357a.tar
      -rw-rw-r--. 1 crx crx  24385536 Feb  4 16:41 data01357a.tar.bak
      -rw-rw-r--. 1 crx crx    359936 Feb  4 17:18 data01358a.tar
      -rw-rw-r--. 1 crx crx    345088 Feb  4 17:17 data01358a.tar.bak
      -rw-rw-r--. 1 crx crx  70582272 Feb  4 18:35 data01359a.tar
      -rw-rw-r--. 1 crx crx  66359296 Feb  4 18:33 data01359a.tar.bak
      -rw-rw-r--. 1 crx crx    282112 Feb  4 18:46 data01360a.tar
      -rw-rw-r--. 1 crx crx    236544 Feb  4 18:45 data01360a.tar.bak
      -rw-rw-r--. 1 crx crx    138240 Feb  4 18:56 data01361a.tar
      

      So one possible cause is that

      1. At some time earlier to 17:17 lucene index got updated and _2ala.cfs got created.
      2. Post update the head revision in Segment store was updated but the revision yet to made it to journal log
      3. Lucene CopyOnRead logic got event for the change and copied the file
      4. System crashed and hence journal did not got updated
      5. System restarted and per last entry in journal system suffered with some "data loss" and hence index checkpoint also moved back
      6. As checkpoint got reverted index started at earlier state and hence created a file with same name _2ala.cfs
      7. CopyOnRead detected file length change and logged a warning routing call to remote
      8. However other files like _2ala.si, _2ala.cfe which were created in same commit had same size but likely different content which later cause lucene query to start failing

      In such a case a restart after cleaning the existing index content would have brought back the system to normal state.

      So as a fix we would need to come up with some sanity check at time of system startup

      Attachments

        Issue Links

          Activity

            People

              chetanm Chetan Mehrotra
              chetanm Chetan Mehrotra
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: