[SOLR-10840] Random Index Corruption during bulk indexing - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 6.3, 6.5.1
Fix Version/s: None
Component/s: update
Labels:
None
Environment:

AWS EC2 instance running Centos 7

Description

I'm seeing a randomly occuring Index Corruption exception during a Solr data ingest. This can occur anywhere during the 7-8 hours our ingests take. I'm initially submitting this as a Solr bug as this is the envioronment I'm using, but it does look as though the error is occurring in Lucene code.

Some background:

AWS EC2 server running CentOS 7
java.runtime.version: 1.8.0_131-b11 (also occurred with 1.8.0_45).
Solr 6.3.0 (have also seen it with Solr 6.5.1). It did not happen with Solr 5.4 9which i can't go back to). Oddly enough, I ran Solr 6.3.0 unvenetfully for several weeks before this problem first occurred.
Standalone (non cloud) environment.

Our indexing subsystem is a complex Python script which creates multiple indexing subprocesses in order to make use of multiple cores. Each subprocess reads records from a MySQL database, does some significant preprocessing and sends a batch of documents (defaults to 500) to the Solr update handler (using the Python 'scorched' module). Each content source (there are 5-6) requires a separate instantiation of the script, and these wrapped in a Bash script to run serially.

When the exception occurs, we always see something like the following in the solr.log

ERROR - 2017-06-06 14:37:34.639; [ x:stresstest1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Exception writing document id med-27840-00384802 to the index; possible analysis error.
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:178
...
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:740)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:754)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1558)
at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:279)
at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:211)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:166)
... 42 more
Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")
at org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:75)
at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
at org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:58)
at org.apache.lucene.codecs.lucene53.Lucene53NormsFormat.normsProducer(Lucene53NormsFormat.java:82)
at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:113)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
at org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4068)
at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4026)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3880)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
Suppressed: org.apache.lucene.index.CorruptIndexException: checksum status indeterminate: remaining=0, please run checkindex for more details (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/stresstest1/index/_441.nvm")))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:451)
at org.apache.lucene.codecs.lucene53.Lucene53NormsProducer.<init>(Lucene53NormsProducer.java:63)
... 12 more

This is usually followed in very short order by similar exceptions as other UpdateHandler threads hit the same IOException.

I've also seenwhat I assume is a related error during autocommits-

INFO - 2017-05-30 18:01:35.264; [ x:build0530] org.apache.solr.update.DirectUpdateHandler2; start commit

{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}

ERROR - 2017-05-30 18:01:36.884; [ x:build0530] org.apache.solr.common.SolrException; auto commit error...:org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1820)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1931)
at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:630)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:217)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-2060254071 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/indexes/solrindexes/build0530/index/_15w_Lucene50_0.tip")))
at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:499)
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:411)
at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:520)
at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:178)
at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:445)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:292)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:372)
at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:106)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:74)
at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
at org.apache.lucene.index.BufferedUpdatesStream$SegmentState.<init>(BufferedUpdatesStream.java:384)
at org.apache.lucene.index.BufferedUpdatesStream.openSegmentStates(BufferedUpdatesStream.java:416)
at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:261)
at org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3413)
at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3399)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:454)
at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:276)
at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:235)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1731)
... 10 more

Other Observations:

It's not assocciated with a specific Lucene index file type (the 'read past EOF' has been reported on .fnm, .nvm, .tip. .si, .dvm files)
I've configured merges to use the LogByteSizeMergePolicyFactory and TieredMergePolicyFactory, and I see the failure with either.
The file system (~600gb) is never more than 50% full so disk space is not an issue
I've seen this occur with indexes on both ext4 and xfs file systems (which have been fsck'ed /repaired, and we're not seeing any hardware problems reported in the system logs). These file systems are all SSDs.
Solr is started with a 5Gb heap and I haven't seen heap usage > 3gb; also, there is no concurrent query activity during the indexing process.
I can recover from this error by unloading the core, running fixindex (whixch reports no errors), reloading the core, and continuing indexing from a checkpoint in the indexing script.

I've created a test rig (in Python) which can be run independently of our environment and workflow, and have managed to get this to throw the exception (the first stack trace above is from a run with that).

My semi-informed guess is that this is due to a race condition between segment merges and index updates...

Random Index Corruption during bulk indexing

Details

Description

Attachments

Activity

People

Dates