Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Invalid
-
None
-
None
-
None
-
None
-
New
Description
Recently, I've experienced sporatic corruptions when trying to index wikipedia in the benchmark. I know mikemccand hit similar problems in the nightly benchmark, and he also has an older cpu (see below for more on this).
I am using this python script (compliments of mike) to index wikipedia in a loop, tweaked for lots of threads and heavy merging so it fails faster: http://pastebin.com/jwpdELDe I get corruptions constantly, though sometimes it takes a few iterations.
The errors look like this, where the bytes we write "seem to be fine" but the CRC32 itself is maybe computed incorrectly at write time:
Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=e2b2d8f5 actual=a04da0c (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/data/corrumption_playground/index/_1p_Lucene50_0.tim")))
at perf.IndexThreads$IndexThread.run(IndexThreads.java:402)
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=e2b2d8f5 actual=a04da0c (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/data/corrumption_playground/index/_1p_Lucene50_0.tim")))
This happens with different file extensions (.tip, .tim, .pos, .doc, .dvd, ...). Whenever one of these corrupted files was included in a commit point, I've run "the rest of CheckIndex" minus the CRC32 check and it always passes: but that is no guarantee thats what is happening.
I think maybe the bugs are for some reason, easier to reproduce on my CPU, maybe because its older and only has AVX1, or some other reason:
model : 42
model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
Other notes:
- does not need multiple threads. I did this to make the "test" fail faster. It will fail sometimes with maxBufferedDocs + SerialMergeScheduler + 1 thread, which is deterministic.
- have not tested JDK9 in any way, might be some already-fixed bug.
- I've run numerous hardware tests: memory, disk, etc.
- I've run the tests with two different SSD drives: both fail.
First step: clean up this script and make it so it can be reproduced on other hardware. I can try on my laptop as well.