Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
New
Description
It would be useful for the different files in a Lucene index to include checksums. This would make it easy to spot corruption while copying index files around; the various cloud efforts assume many more data-copying operations than older single-index implementations.
This feature might be much easier to implement if all index files are created in a sequential fashion. This issue therefore depends on LUCENE-2373.
Attachments
Attachments
- LUCENE-2446.patch
- 213 kB
- Robert Muir
Issue Links
- depends upon
-
LUCENE-2373 Create a Codec to work with streaming and append-only filesystems
- Closed
Activity
+1, this looks wonderful; it gives us end-to-end checksums, like ZFS.
This means, when there is a checksum mis-match, we can be quite certain that there's a hardware problem (bit flipper) in the user's env, and not a Lucene bug.
this is awesome - robert do you think it would make sense to commit this to a branch and let CI run it for a bit. That way we can clean stuff up further if need as well? I just wonder though since it's a pretty big patch
I think the patch is quite solid as is ... we should just commit & let it bake for a while on trunk? We can iterate on further improvements...
I just ran distributed beasting from luceneutil (runs all Lucene core + modules tests, across 5 machines, 28 paralle, JVMs) ~150 times over and it didn't hit any test failures.
Is CRC32 intrinsic available and the same on all relevant platforms ?
(I assume it is CRC32C version of Intel)
If not, what happens ? Does it still work ? Does it need to be cross-platform-compatible ?
Could it be a concern for performance-related issue ?
java.util.zip.CRC32 has been used by lucene for the segments_N for quite some time.
This just applies it to other files. And while is nice that it is 3x faster in java 8, it is still 1GB/second with java 7 on my computer.
By the way, the instruction used in java8 is not CRC32C. it is PCLMULQDQ. If they broke this, it would be a big bug in java
I glanced at the patch again I think it looks pretty good. I didn't look close enough before! +1 to move that in and let it bake
Great!
I am not so happy about the method name validate(). Especially as it is in AtomicReader, people might think (because of name), its just some cheap check. Yes, I know there are Javadocs, but IDEs suggestions may lead to use it!
I have no yet thought thoroghly about a better name, maybe checkIntegrity()?
Is CRC32 intrinsic available and the same on all relevant platforms ?
Does not matter what the JVM does internally. The public API for CRC32 is there in the java.util.zip package and we use it since the early beginning of Lucene. If it would be suddenly return incorrect/different results, it would violate ZIP spec (which is indirectly an ISO standard - through OpenDocument ISO standard).
bq: I am not so happy about the method name
Why not a method name that indicates the function used? Something like checkIntegrityUsingRobertsNiftyNewCRC32Code?
Hmmm, a little long, but what about something like
validateCRC32
or
checkIntegrityCRC32
Although its currently only doing file integrity checks, we may extend this later! Maybe remove more parts of CheckIndex class and move it to Codec level?
Another idea: Move validate()/checkIntegrity() up to IndexReader, so it would also work on DirectoryReaders? This would help with cleaning up stuff from CheckIndex.
I don't want to do this Uwe. I dont want all this checkindex stuff being run at merge. A big point here is to provide an option to prevent detection on merge.
If you read my description I am unhappy about the current situation because it cannot be enabled by default for performance reasons. I want to fix the codec apis so this is no longer the case. I don't want checkindex code moved into this method.
Robert, sorry if I throwed some unrelated ideas in here: My idea here has not much to do with the stuffs here. It just came into my mind when I was seeing the validate() method which made me afraid.
Now just throwing in something, which does not have to affect you while implementing checksumming - just think about it as a separate issue, inspired by this one:
In general CheckIndex as a separate class is wrong in my opinion. Especially if it does checks not valid for all codecs (using instanceof checks in CheckIndex code - horrible). A really good CheckIndex should work like fsck - implemented by the file system driver + filesystem code. So the checks currently done by CheckIndex should be done by the codecs (only general checks may work on top - like numDocs checks, etc.)
Now back to this issue: To come back to the initial suggestion: validate => checkIntegrity. What do you think?
I think checkIntegrity is too generic (i.e. it does not check the integrity of the posting lists, a'la what CheckIndex does. How about validateChecksums, since that's what it currently does? I don't mind if it's renamed in the future if the method validates more things, we shouldn't worry about back-compat with this API.
You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. The only thing i care about is that its not mixed up with other logic.
Its really important when debugging a corrupt index that you can know that the file itself is corrupted (e.g. hardware) versus some bug in lucene. Its also really important that there is at least the option (like there is in this patch) to prevent this corruption from being propagated in merge. So lets please not mix unrelated stuff in here.
Things like refactoring checkindex or whatever: please just open a separate issue for that
You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out.
Uwe, so what do you think: validateChecksums or checkIntegrity?
Let's get this thing committed so Jenkins can bless it.
checkIntegrity is fine with me. We can also rename it before releasing.
I'll mark it internal for now...
Commit 1583550 from Robert Muir in branch 'dev/trunk'
[ https://svn.apache.org/r1583550 ]
LUCENE-2446: add checksums to index files
Thanks Robert!
checkIntegrity() is in my opinion the best we can have: most generic, but not to easy to misunderstand and run after every commitI was just afraid about the optimize()-loving people!
Uwe
One idea to make the whole thing more hidden to the user:
I am not sure if we need the abstract checkIntegrity() on the public AtomicReader abstract interface - because it only applies to disk-based indexes. Would it be not enough to have it on SegmentReader? I know we might need some instanceof checks while merging, but I think we do those already. By that it would not be public and we can hide it to the user. We would only not validate on FilterAtomicReader's merging (split / sort indexes).
Any opinion about this?
I don't think that its true it "only applies to disk-based indexes".
FilterReader/SlowWrapper etc pass this down to their underlying readers so you still get the check on the underlying data e.g. if you are using SortingMergePolicy.
Commit 1583565 from Robert Muir in branch 'dev/trunk'
[ https://svn.apache.org/r1583565 ]
LUCENE-2446: ensure we close file if we hit exception writing codec header
Commit 1583863 from Robert Muir in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1583863 ]
LUCENE-2446: add checksums to index files
Hi Robert,
I think we should add a Lucene 4.8 TestBackwardsCompatibility index to the test resources. Should I prepare one?
But we should take care of it, maybe open issue! so we add it once 4.8 is released?
It's part of the release workflow. Come on, this is nothing new and no different from any other format change. No need to single out this issue.
I think this is a pretty important issue: besides the case of distributed system copying files around, we have the issue that today there is no integrity mechanism to detect hardware issues (can cause developers to pull hair out trying to debug corruptions), and we have some optimized components doing bulk merge which can propagate corruptions to new segments over a long time.
Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is intrinsic and uses clmul hardware instructions on x86 and so on.
I created an initial patch: the last 8 bytes of every file is a zlib-crc32 checksum. We also write some additional metadata before it (its done via CodecUtil.writeFooter) so we can extend it more in the future if we need.
For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we open, because we are reading the file anyway. So this provides some extra safety.
For larger files this would be expensive: instead the patch adds AtomicReader.validate() which asks the codec (or filterreader, or whatever), to ensure everything is valid. This is called by e.g. checkindex before decoding.
Patch adds an option (defaults to off) on IndexWriterConfig to call this before merging. Ideally we wouldnt need this and just validate-as-we-merge, but that requires some codec/merge API changes...
File format changes are backwards compatible.