Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2729

Index corruption after 'read past EOF' under heavy update load and snapshot export

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 3.0.1, 3.0.2
    • None
    • core/index
    • None
    • Happens on both OS X 10.6 and Windows 2008 Server. Integrated with zoie (using a zoie snapshot from 2010-08-06: zoie-2.0.0-snapshot-20100806.jar).

    • New

    Description

      We have a system running lucene and zoie. We use lucene as a content store for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled backups of the index. This works fine for small indexes and when there are not a lot of changes to the index when the backup is made.

      On large indexes (about 5 GB to 19 GB), when a backup is made while the index is being changed a lot (lots of document additions and/or deletions), we almost always get a 'read past EOF' at some point, followed by lots of 'Lock obtain timed out'.
      At that point we get lots of 0 kb files in the index, data gets lots, and the index is unusable.

      When we stop our server, remove the 0kb files and restart our server, the index is operational again, but data has been lost.

      I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. Hopefully someone has some ideas where to look to fix this.

      Some more details...

      Stack trace of the read past EOF and following Lock obtain timed out:

      78307 [proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@31ca5085] 
          ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
      java.io.IOException: read past EOF
          at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
          at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
          at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
          at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
          at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
          at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:166)
          at org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
          at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
          at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
          at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
          at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
          at proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
          at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
          at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
          at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
          at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
          at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
      579336 [proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@31ca5085] 
          ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
          Problem copying segments: Lock obtain timed out: 
          org.apache.lucene.store.SingleInstanceLock@5ad0b895: write.lock
      org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
          org.apache.lucene.store.SingleInstanceLock@5ad0b895: write.lock
          at org.apache.lucene.store.Lock.obtain(Lock.java:84)
          at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
          at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:957)
          at proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
          at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
          at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
          at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
          at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
          at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
      

      We get exactly the same behavour on both OS X and on Windows. On both zoie is using a SimpleFSDirectory.
      We also use a SingleInstanceLockFactory (since our process is the only one working with the index), but we get the same behaviour with a NativeFSLock.

      The snapshot backup is being made by calling:

      proj.zoie.impl.indexing.ZoieSystem.exportSnapshot(WritableByteChannel)

      Same issue in zoie JIRA:

      http://snaprojects.jira.com/browse/ZOIE-51

      Attachments

        1. read-past-eof-debugging.zip
          1.17 MB
          Jan te Beest
        2. LUCENE-2729-test1.patch
          0.8 kB
          Michael McCandless
        3. eof-extra-logging-4-analysis.txt
          9 kB
          Nico Krijnen
        4. eof-extra-logging-4.log.zip
          3.00 MB
          Nico Krijnen
        5. backup_force_failure2.log.zip
          4.44 MB
          Nico Krijnen
        6. 2010-11-02 IndexWriter infoStream log.zip
          1.13 MB
          Nico Krijnen

        Activity

          People

            Unassigned Unassigned
            nkrijnen Nico Krijnen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: