Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10840

Random Index Corruption during bulk indexing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 6.3, 6.5.1
    • None
    • update
    • None
    • AWS EC2 instance running Centos 7

    Description

      I'm seeing a randomly occuring Index Corruption exception during a Solr data ingest. This can occur anywhere during the 7-8 hours our ingests take. I'm initially submitting this as a Solr bug as this is the envioronment I'm using, but it does look as though the error is occurring in Lucene code.

      Some background:

      AWS EC2 server running CentOS 7
      java.​runtime.​version: 1.8.0_131-b11 (also occurred with 1.8.0_45).
      Solr 6.3.0 (have also seen it with Solr 6.5.1). It did not happen with Solr 5.4 9which i can't go back to). Oddly enough, I ran Solr 6.3.0 unvenetfully for several weeks before this problem first occurred.
      Standalone (non cloud) environment.

      Our indexing subsystem is a complex Python script which creates multiple indexing subprocesses in order to make use of multiple cores. Each subprocess reads records from a MySQL database, does some significant preprocessing and sends a batch of documents (defaults to 500) to the Solr update handler (using the Python 'scorched' module). Each content source (there are 5-6) requires a separate instantiation of the script, and these wrapped in a Bash script to run serially.

      When the exception occurs, we always see something like the following in the solr.log

      ERROR - 2017-06-06 14:37:34.639; [ x:stresstest1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Exception writing document id med-27840-00384802 to the index; possible analysis error.
      at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178
      ...
      Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
      at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740)
      at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754)
      at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558)
      at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:279)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
      at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
      ... 42 more
      Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")
      at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
      at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
      at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
      at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
      at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
      at org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:58)
      at org.apache.lucene.codecs.lucene53.Lucene53NormsFormat.normsProducer(Lucene53NormsFormat.java:82)
      at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:113)
      at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
      at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
      at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
      at org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
      at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
      at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068)
      at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026)
      at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880)
      at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
      at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
      Suppressed: org.apache.lucene.index.CorruptIndexException: checksum status indeterminate: remaining=0, please run checkindex for more details (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")))
      at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451)
      at org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:63)
      ... 12 more

      This is usually followed in very short order by similar exceptions as other UpdateHandler threads hit the same IOException.

      I've also seenwhat I assume is a related error during autocommits-

      INFO - 2017-05-30 18:01:35.264; [ x:build0530] org.apache.solr.update.DirectUpdateHandler2; start commit

      {,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}

      ERROR - 2017-05-30 18:01:36.884; [ x:build0530] org.apache.solr.common.SolrException; auto commit error...:org.apache.solr.common.SolrException: Error opening new searcher
      at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
      at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
      at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:630)
      at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
      at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-2060254071 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0530/index/_15w_Lucene50_0.tip")))
      at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:499)
      at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:411)
      at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:520)
      at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:178)
      at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
      at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:292)
      at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
      at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:106)
      at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
      at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
      at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
      at org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
      at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
      at org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413)
      at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399)
      at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:454)
      at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
      at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
      at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
      at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1731)
      ... 10 more

      Other Observations:

      It's not assocciated with a specific Lucene index file type (the 'read past EOF' has been reported on .fnm, .nvm, .tip. .si, .dvm files)
      I've configured merges to use the LogByteSizeMergePolicyFactory and TieredMergePolicyFactory, and I see the failure with either.
      The file system (~600gb) is never more than 50% full so disk space is not an issue
      I've seen this occur with indexes on both ext4 and xfs file systems (which have been fsck'ed /repaired, and we're not seeing any hardware problems reported in the system logs). These file systems are all SSDs.
      Solr is started with a 5Gb heap and I haven't seen heap usage > 3gb; also, there is no concurrent query activity during the indexing process.
      I can recover from this error by unloading the core, running fixindex (whixch reports no errors), reloading the core, and continuing indexing from a checkpoint in the indexing script.

      I've created a test rig (in Python) which can be run independently of our environment and workflow, and have managed to get this to throw the exception (the first stack trace above is from a run with that).

      My semi-informed guess is that this is due to a race condition between segment merges and index updates...

      Attachments

        Activity

          People

            Unassigned Unassigned
            simon.rosenthal Simon Rosenthal
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: