[LUCENE-2446] Add checksums to Lucene segment files - ASF JIRA

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.8, 6.0
Component/s: core/index
Labels:
- checksum

Lucene Fields:

New

Description

It would be useful for the different files in a Lucene index to include checksums. This would make it easy to spot corruption while copying index files around; the various cloud efforts assume many more data-copying operations than older single-index implementations.

This feature might be much easier to implement if all index files are created in a sequential fashion. This issue therefore depends on ~~LUCENE-2373~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2446.patch
30/Mar/14 23:09
213 kB
Robert Muir

Issue Links

depends upon

LUCENE-2373 Create a Codec to work with streaming and append-only filesystems

Closed

Activity

Ascending order - Click to sort in descending order

Robert Muir added a comment - 30/Mar/14 23:09

I think this is a pretty important issue: besides the case of distributed system copying files around, we have the issue that today there is no integrity mechanism to detect hardware issues (can cause developers to pull hair out trying to debug corruptions), and we have some optimized components doing bulk merge which can propagate corruptions to new segments over a long time.

Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is intrinsic and uses clmul hardware instructions on x86 and so on.

I created an initial patch: the last 8 bytes of every file is a zlib-crc32 checksum. We also write some additional metadata before it (its done via CodecUtil.writeFooter) so we can extend it more in the future if we need.

For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we open, because we are reading the file anyway. So this provides some extra safety.

For larger files this would be expensive: instead the patch adds AtomicReader.validate() which asks the codec (or filterreader, or whatever), to ensure everything is valid. This is called by e.g. checkindex before decoding.

Patch adds an option (defaults to off) on IndexWriterConfig to call this before merging. Ideally we wouldnt need this and just validate-as-we-merge, but that requires some codec/merge API changes...

File format changes are backwards compatible.

Robert Muir added a comment - 30/Mar/14 23:09 I think this is a pretty important issue: besides the case of distributed system copying files around, we have the issue that today there is no integrity mechanism to detect hardware issues (can cause developers to pull hair out trying to debug corruptions), and we have some optimized components doing bulk merge which can propagate corruptions to new segments over a long time. Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is intrinsic and uses clmul hardware instructions on x86 and so on. I created an initial patch: the last 8 bytes of every file is a zlib-crc32 checksum. We also write some additional metadata before it (its done via CodecUtil.writeFooter) so we can extend it more in the future if we need. For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we open, because we are reading the file anyway. So this provides some extra safety. For larger files this would be expensive: instead the patch adds AtomicReader.validate() which asks the codec (or filterreader, or whatever), to ensure everything is valid. This is called by e.g. checkindex before decoding. Patch adds an option (defaults to off) on IndexWriterConfig to call this before merging. Ideally we wouldnt need this and just validate-as-we-merge, but that requires some codec/merge API changes... File format changes are backwards compatible.

Shai Erera added a comment - 31/Mar/14 08:13

This looks really great! +1

Shai Erera added a comment - 31/Mar/14 08:13 This looks really great! +1

Michael McCandless added a comment - 31/Mar/14 10:17

+1, this looks wonderful; it gives us end-to-end checksums, like ZFS.

This means, when there is a checksum mis-match, we can be quite certain that there's a hardware problem (bit flipper) in the user's env, and not a Lucene bug.

Michael McCandless added a comment - 31/Mar/14 10:17 +1, this looks wonderful; it gives us end-to-end checksums, like ZFS. This means, when there is a checksum mis-match, we can be quite certain that there's a hardware problem (bit flipper) in the user's env, and not a Lucene bug.

Simon Willnauer added a comment - 31/Mar/14 10:32

this is awesome - robert do you think it would make sense to commit this to a branch and let CI run it for a bit. That way we can clean stuff up further if need as well? I just wonder though since it's a pretty big patch

Simon Willnauer added a comment - 31/Mar/14 10:32 this is awesome - robert do you think it would make sense to commit this to a branch and let CI run it for a bit. That way we can clean stuff up further if need as well? I just wonder though since it's a pretty big patch

Michael McCandless added a comment - 31/Mar/14 10:56

I think the patch is quite solid as is ... we should just commit & let it bake for a while on trunk? We can iterate on further improvements...

I just ran distributed beasting from luceneutil (runs all Lucene core + modules tests, across 5 machines, 28 paralle, JVMs) ~150 times over and it didn't hit any test failures.

Michael McCandless added a comment - 31/Mar/14 10:56 I think the patch is quite solid as is ... we should just commit & let it bake for a while on trunk? We can iterate on further improvements... I just ran distributed beasting from luceneutil (runs all Lucene core + modules tests, across 5 machines, 28 paralle, JVMs) ~150 times over and it didn't hit any test failures.

Bert Sanders added a comment - 31/Mar/14 11:19 - edited

Is CRC32 intrinsic available and the same on all relevant platforms ?
(I assume it is CRC32C version of Intel)

If not, what happens ? Does it still work ? Does it need to be cross-platform-compatible ?
Could it be a concern for performance-related issue ?

Bert Sanders added a comment - 31/Mar/14 11:19 - edited Is CRC32 intrinsic available and the same on all relevant platforms ? (I assume it is CRC32C version of Intel) If not, what happens ? Does it still work ? Does it need to be cross-platform-compatible ? Could it be a concern for performance-related issue ?

Robert Muir added a comment - 31/Mar/14 11:35

java.util.zip.CRC32 has been used by lucene for the segments_N for quite some time.
This just applies it to other files. And while is nice that it is 3x faster in java 8, it is still 1GB/second with java 7 on my computer.
By the way, the instruction used in java8 is not CRC32C. it is PCLMULQDQ. If they broke this, it would be a big bug in java

Robert Muir added a comment - 31/Mar/14 11:35 java.util.zip.CRC32 has been used by lucene for the segments_N for quite some time. This just applies it to other files. And while is nice that it is 3x faster in java 8, it is still 1GB/second with java 7 on my computer. By the way, the instruction used in java8 is not CRC32C. it is PCLMULQDQ. If they broke this, it would be a big bug in java

Simon Willnauer added a comment - 31/Mar/14 12:20

I glanced at the patch again I think it looks pretty good. I didn't look close enough before! +1 to move that in and let it bake

Simon Willnauer added a comment - 31/Mar/14 12:20 I glanced at the patch again I think it looks pretty good. I didn't look close enough before! +1 to move that in and let it bake

Uwe Schindler added a comment - 31/Mar/14 12:37

Great!

I am not so happy about the method name validate(). Especially as it is in AtomicReader, people might think (because of name), its just some cheap check. Yes, I know there are Javadocs, but IDEs suggestions may lead to use it!

I have no yet thought thoroghly about a better name, maybe checkIntegrity()?

Uwe Schindler added a comment - 31/Mar/14 12:37 Great! I am not so happy about the method name validate() . Especially as it is in AtomicReader, people might think (because of name), its just some cheap check. Yes, I know there are Javadocs, but IDEs suggestions may lead to use it! I have no yet thought thoroghly about a better name, maybe checkIntegrity() ?

Uwe Schindler added a comment - 31/Mar/14 12:40

Is CRC32 intrinsic available and the same on all relevant platforms ?

Does not matter what the JVM does internally. The public API for CRC32 is there in the java.util.zip package and we use it since the early beginning of Lucene. If it would be suddenly return incorrect/different results, it would violate ZIP spec (which is indirectly an ISO standard - through OpenDocument ISO standard).

Uwe Schindler added a comment - 31/Mar/14 12:40 Is CRC32 intrinsic available and the same on all relevant platforms ? Does not matter what the JVM does internally. The public API for CRC32 is there in the java.util.zip package and we use it since the early beginning of Lucene. If it would be suddenly return incorrect/different results, it would violate ZIP spec (which is indirectly an ISO standard - through OpenDocument ISO standard).

Erick Erickson added a comment - 31/Mar/14 12:41

bq: I am not so happy about the method name

Why not a method name that indicates the function used? Something like checkIntegrityUsingRobertsNiftyNewCRC32Code?

Hmmm, a little long, but what about something like
validateCRC32
or
checkIntegrityCRC32

Erick Erickson added a comment - 31/Mar/14 12:41 bq: I am not so happy about the method name Why not a method name that indicates the function used? Something like checkIntegrityUsingRobertsNiftyNewCRC32Code? Hmmm, a little long, but what about something like validateCRC32 or checkIntegrityCRC32

Uwe Schindler added a comment - 31/Mar/14 12:42 - edited

Although its currently only doing file integrity checks, we may extend this later! Maybe remove more parts of CheckIndex class and move it to Codec level?

Another idea: Move validate()/checkIntegrity() up to IndexReader, so it would also work on DirectoryReaders? This would help with cleaning up stuff from CheckIndex.

Uwe Schindler added a comment - 31/Mar/14 12:42 - edited Although its currently only doing file integrity checks, we may extend this later! Maybe remove more parts of CheckIndex class and move it to Codec level? Another idea: Move validate()/checkIntegrity() up to IndexReader, so it would also work on DirectoryReaders? This would help with cleaning up stuff from CheckIndex.

Robert Muir added a comment - 31/Mar/14 13:02

I don't want to do this Uwe. I dont want all this checkindex stuff being run at merge. A big point here is to provide an option to prevent detection on merge.

If you read my description I am unhappy about the current situation because it cannot be enabled by default for performance reasons. I want to fix the codec apis so this is no longer the case. I don't want checkindex code moved into this method.

Robert Muir added a comment - 31/Mar/14 13:02 I don't want to do this Uwe. I dont want all this checkindex stuff being run at merge. A big point here is to provide an option to prevent detection on merge. If you read my description I am unhappy about the current situation because it cannot be enabled by default for performance reasons. I want to fix the codec apis so this is no longer the case. I don't want checkindex code moved into this method.

Uwe Schindler added a comment - 31/Mar/14 13:14

Robert, sorry if I throwed some unrelated ideas in here: My idea here has not much to do with the stuffs here. It just came into my mind when I was seeing the validate() method which made me afraid.

Now just throwing in something, which does not have to affect you while implementing checksumming - just think about it as a separate issue, inspired by this one:

Uwe's unrelated idea

In general CheckIndex as a separate class is wrong in my opinion. Especially if it does checks not valid for all codecs (using instanceof checks in CheckIndex code - horrible). A really good CheckIndex should work like fsck - implemented by the file system driver + filesystem code. So the checks currently done by CheckIndex should be done by the codecs (only general checks may work on top - like numDocs checks, etc.)

Now back to this issue: To come back to the initial suggestion: validate => checkIntegrity. What do you think?

Uwe Schindler added a comment - 31/Mar/14 13:14 Robert, sorry if I throwed some unrelated ideas in here: My idea here has not much to do with the stuffs here. It just came into my mind when I was seeing the validate() method which made me afraid. Now just throwing in something, which does not have to affect you while implementing checksumming - just think about it as a separate issue, inspired by this one: Uwe's unrelated idea In general CheckIndex as a separate class is wrong in my opinion. Especially if it does checks not valid for all codecs (using instanceof checks in CheckIndex code - horrible). A really good CheckIndex should work like fsck - implemented by the file system driver + filesystem code. So the checks currently done by CheckIndex should be done by the codecs (only general checks may work on top - like numDocs checks, etc.) Now back to this issue: To come back to the initial suggestion: validate => checkIntegrity. What do you think?

Shai Erera added a comment - 31/Mar/14 13:22

I think checkIntegrity is too generic (i.e. it does not check the integrity of the posting lists, a'la what CheckIndex does. How about validateChecksums, since that's what it currently does? I don't mind if it's renamed in the future if the method validates more things, we shouldn't worry about back-compat with this API.

Shai Erera added a comment - 31/Mar/14 13:22 I think checkIntegrity is too generic (i.e. it does not check the integrity of the posting lists, a'la what CheckIndex does. How about validateChecksums , since that's what it currently does? I don't mind if it's renamed in the future if the method validates more things, we shouldn't worry about back-compat with this API.

Robert Muir added a comment - 31/Mar/14 13:30

You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. The only thing i care about is that its not mixed up with other logic.

Its really important when debugging a corrupt index that you can know that the file itself is corrupted (e.g. hardware) versus some bug in lucene. Its also really important that there is at least the option (like there is in this patch) to prevent this corruption from being propagated in merge. So lets please not mix unrelated stuff in here.

Things like refactoring checkindex or whatever: please just open a separate issue for that

Robert Muir added a comment - 31/Mar/14 13:30 You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. The only thing i care about is that its not mixed up with other logic. Its really important when debugging a corrupt index that you can know that the file itself is corrupted (e.g. hardware) versus some bug in lucene. Its also really important that there is at least the option (like there is in this patch) to prevent this corruption from being propagated in merge. So lets please not mix unrelated stuff in here. Things like refactoring checkindex or whatever: please just open a separate issue for that

Shai Erera added a comment - 01/Apr/14 06:44

You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out.

Uwe, so what do you think: validateChecksums or checkIntegrity?
Let's get this thing committed so Jenkins can bless it.

Shai Erera added a comment - 01/Apr/14 06:44 You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. Uwe, so what do you think: validateChecksums or checkIntegrity ? Let's get this thing committed so Jenkins can bless it.

Robert Muir added a comment - 01/Apr/14 06:48

checkIntegrity is fine with me. We can also rename it before releasing.

I'll mark it internal for now...

Robert Muir added a comment - 01/Apr/14 06:48 checkIntegrity is fine with me. We can also rename it before releasing. I'll mark it internal for now...

Shai Erera added a comment - 01/Apr/14 06:48

Shai Erera added a comment - 01/Apr/14 06:48 +1

ASF subversion and git services added a comment - 01/Apr/14 07:36

Commit 1583550 from Robert Muir in branch 'dev/trunk'
[ https://svn.apache.org/r1583550 ]

~~LUCENE-2446~~: add checksums to index files

ASF subversion and git services added a comment - 01/Apr/14 07:36 Commit 1583550 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1583550 ] LUCENE-2446 : add checksums to index files

Uwe Schindler added a comment - 01/Apr/14 07:49

Thanks Robert!
checkIntegrity() is in my opinion the best we can have: most generic, but not to easy to misunderstand and run after every commitI was just afraid about the optimize()-loving people!
Uwe

Uwe Schindler added a comment - 01/Apr/14 07:49 Thanks Robert! checkIntegrity() is in my opinion the best we can have: most generic, but not to easy to misunderstand and run after every commitI was just afraid about the optimize() -loving people! Uwe

Uwe Schindler added a comment - 01/Apr/14 07:56

One idea to make the whole thing more hidden to the user:
I am not sure if we need the abstract checkIntegrity() on the public AtomicReader abstract interface - because it only applies to disk-based indexes. Would it be not enough to have it on SegmentReader? I know we might need some instanceof checks while merging, but I think we do those already. By that it would not be public and we can hide it to the user. We would only not validate on FilterAtomicReader's merging (split / sort indexes).

Any opinion about this?

Uwe Schindler added a comment - 01/Apr/14 07:56 One idea to make the whole thing more hidden to the user: I am not sure if we need the abstract checkIntegrity() on the public AtomicReader abstract interface - because it only applies to disk-based indexes. Would it be not enough to have it on SegmentReader? I know we might need some instanceof checks while merging, but I think we do those already. By that it would not be public and we can hide it to the user. We would only not validate on FilterAtomicReader's merging (split / sort indexes). Any opinion about this?

Robert Muir added a comment - 01/Apr/14 08:13

I don't think that its true it "only applies to disk-based indexes".

FilterReader/SlowWrapper etc pass this down to their underlying readers so you still get the check on the underlying data e.g. if you are using SortingMergePolicy.

Robert Muir added a comment - 01/Apr/14 08:13 I don't think that its true it "only applies to disk-based indexes". FilterReader/SlowWrapper etc pass this down to their underlying readers so you still get the check on the underlying data e.g. if you are using SortingMergePolicy.

ASF subversion and git services added a comment - 01/Apr/14 08:21

Commit 1583565 from Robert Muir in branch 'dev/trunk'
[ https://svn.apache.org/r1583565 ]

~~LUCENE-2446~~: ensure we close file if we hit exception writing codec header

ASF subversion and git services added a comment - 01/Apr/14 08:21 Commit 1583565 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1583565 ] LUCENE-2446 : ensure we close file if we hit exception writing codec header

ASF subversion and git services added a comment - 02/Apr/14 03:20

Commit 1583863 from Robert Muir in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1583863 ]

~~LUCENE-2446~~: add checksums to index files

ASF subversion and git services added a comment - 02/Apr/14 03:20 Commit 1583863 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1583863 ] LUCENE-2446 : add checksums to index files

Robert Muir added a comment - 02/Apr/14 03:20

I will open followups for some of the ideas here.

Robert Muir added a comment - 02/Apr/14 03:20 I will open followups for some of the ideas here.

Uwe Schindler added a comment - 04/Apr/14 23:31

Hi Robert,
I think we should add a Lucene 4.8 TestBackwardsCompatibility index to the test resources. Should I prepare one?

Uwe Schindler added a comment - 04/Apr/14 23:31 Hi Robert, I think we should add a Lucene 4.8 TestBackwardsCompatibility index to the test resources. Should I prepare one?

Robert Muir added a comment - 04/Apr/14 23:35

Absolutely not! 4.8 is not released

Robert Muir added a comment - 04/Apr/14 23:35 Absolutely not! 4.8 is not released

Uwe Schindler added a comment - 05/Apr/14 11:07

But we should take care of it, maybe open issue! so we add it once 4.8 is released?

Uwe Schindler added a comment - 05/Apr/14 11:07 But we should take care of it, maybe open issue! so we add it once 4.8 is released?

Robert Muir added a comment - 05/Apr/14 11:12

It's part of the release workflow. Come on, this is nothing new and no different from any other format change. No need to single out this issue.

Robert Muir added a comment - 05/Apr/14 11:12 It's part of the release workflow. Come on, this is nothing new and no different from any other format change. No need to single out this issue.

Uwe Schindler added a comment - 27/Apr/14 23:26

Close issue after release of 4.8.0

Uwe Schindler added a comment - 27/Apr/14 23:26 Close issue after release of 4.8.0

Tomoko Uchida added a comment - 28/Aug/22 12:26

This issue was moved to GitHub issue: #3520.

Tomoko Uchida added a comment - 28/Aug/22 12:26 This issue was moved to GitHub issue: #3520 .

People

Assignee:: Unassigned

Reporter:: Lance Norskog

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/May/10 03:17

Updated:: 28/Aug/22 12:26

Resolved:: 02/Apr/14 03:20

Agile

View on Board

Slack

Issue deployment

Lucene - Core

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment