Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 5.0
    • Component/s: core/index
    • Labels:
    • Lucene Fields:
      New

      Description

      It would be useful for the different files in a Lucene index to include checksums. This would make it easy to spot corruption while copying index files around; the various cloud efforts assume many more data-copying operations than older single-index implementations.

      This feature might be much easier to implement if all index files are created in a sequential fashion. This issue therefore depends on LUCENE-2373.

      1. LUCENE-2446.patch
        213 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          I think this is a pretty important issue: besides the case of distributed system copying files around, we have the issue that today there is no integrity mechanism to detect hardware issues (can cause developers to pull hair out trying to debug corruptions), and we have some optimized components doing bulk merge which can propagate corruptions to new segments over a long time.

          Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is intrinsic and uses clmul hardware instructions on x86 and so on.

          I created an initial patch: the last 8 bytes of every file is a zlib-crc32 checksum. We also write some additional metadata before it (its done via CodecUtil.writeFooter) so we can extend it more in the future if we need.

          For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we open, because we are reading the file anyway. So this provides some extra safety.

          For larger files this would be expensive: instead the patch adds AtomicReader.validate() which asks the codec (or filterreader, or whatever), to ensure everything is valid. This is called by e.g. checkindex before decoding.

          Patch adds an option (defaults to off) on IndexWriterConfig to call this before merging. Ideally we wouldnt need this and just validate-as-we-merge, but that requires some codec/merge API changes...

          File format changes are backwards compatible.

          Show
          Robert Muir added a comment - I think this is a pretty important issue: besides the case of distributed system copying files around, we have the issue that today there is no integrity mechanism to detect hardware issues (can cause developers to pull hair out trying to debug corruptions), and we have some optimized components doing bulk merge which can propagate corruptions to new segments over a long time. Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is intrinsic and uses clmul hardware instructions on x86 and so on. I created an initial patch: the last 8 bytes of every file is a zlib-crc32 checksum. We also write some additional metadata before it (its done via CodecUtil.writeFooter) so we can extend it more in the future if we need. For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we open, because we are reading the file anyway. So this provides some extra safety. For larger files this would be expensive: instead the patch adds AtomicReader.validate() which asks the codec (or filterreader, or whatever), to ensure everything is valid. This is called by e.g. checkindex before decoding. Patch adds an option (defaults to off) on IndexWriterConfig to call this before merging. Ideally we wouldnt need this and just validate-as-we-merge, but that requires some codec/merge API changes... File format changes are backwards compatible.
          Hide
          Shai Erera added a comment -

          This looks really great! +1

          Show
          Shai Erera added a comment - This looks really great! +1
          Hide
          Michael McCandless added a comment -

          +1, this looks wonderful; it gives us end-to-end checksums, like ZFS.

          This means, when there is a checksum mis-match, we can be quite certain that there's a hardware problem (bit flipper) in the user's env, and not a Lucene bug.

          Show
          Michael McCandless added a comment - +1, this looks wonderful; it gives us end-to-end checksums, like ZFS. This means, when there is a checksum mis-match, we can be quite certain that there's a hardware problem (bit flipper) in the user's env, and not a Lucene bug.
          Hide
          Simon Willnauer added a comment -

          this is awesome - robert do you think it would make sense to commit this to a branch and let CI run it for a bit. That way we can clean stuff up further if need as well? I just wonder though since it's a pretty big patch

          Show
          Simon Willnauer added a comment - this is awesome - robert do you think it would make sense to commit this to a branch and let CI run it for a bit. That way we can clean stuff up further if need as well? I just wonder though since it's a pretty big patch
          Hide
          Michael McCandless added a comment -

          I think the patch is quite solid as is ... we should just commit & let it bake for a while on trunk? We can iterate on further improvements...

          I just ran distributed beasting from luceneutil (runs all Lucene core + modules tests, across 5 machines, 28 paralle, JVMs) ~150 times over and it didn't hit any test failures.

          Show
          Michael McCandless added a comment - I think the patch is quite solid as is ... we should just commit & let it bake for a while on trunk? We can iterate on further improvements... I just ran distributed beasting from luceneutil (runs all Lucene core + modules tests, across 5 machines, 28 paralle, JVMs) ~150 times over and it didn't hit any test failures.
          Hide
          Bert Sanders added a comment - - edited

          Is CRC32 intrinsic available and the same on all relevant platforms ?
          (I assume it is CRC32C version of Intel)

          If not, what happens ? Does it still work ? Does it need to be cross-platform-compatible ?
          Could it be a concern for performance-related issue ?

          Show
          Bert Sanders added a comment - - edited Is CRC32 intrinsic available and the same on all relevant platforms ? (I assume it is CRC32C version of Intel) If not, what happens ? Does it still work ? Does it need to be cross-platform-compatible ? Could it be a concern for performance-related issue ?
          Hide
          Robert Muir added a comment -

          java.util.zip.CRC32 has been used by lucene for the segments_N for quite some time.
          This just applies it to other files. And while is nice that it is 3x faster in java 8, it is still 1GB/second with java 7 on my computer.
          By the way, the instruction used in java8 is not CRC32C. it is PCLMULQDQ. If they broke this, it would be a big bug in java

          Show
          Robert Muir added a comment - java.util.zip.CRC32 has been used by lucene for the segments_N for quite some time. This just applies it to other files. And while is nice that it is 3x faster in java 8, it is still 1GB/second with java 7 on my computer. By the way, the instruction used in java8 is not CRC32C. it is PCLMULQDQ. If they broke this, it would be a big bug in java
          Hide
          Simon Willnauer added a comment -

          I glanced at the patch again I think it looks pretty good. I didn't look close enough before! +1 to move that in and let it bake

          Show
          Simon Willnauer added a comment - I glanced at the patch again I think it looks pretty good. I didn't look close enough before! +1 to move that in and let it bake
          Hide
          Uwe Schindler added a comment -

          Great!

          I am not so happy about the method name validate(). Especially as it is in AtomicReader, people might think (because of name), its just some cheap check. Yes, I know there are Javadocs, but IDEs suggestions may lead to use it!

          I have no yet thought thoroghly about a better name, maybe checkIntegrity()?

          Show
          Uwe Schindler added a comment - Great! I am not so happy about the method name validate() . Especially as it is in AtomicReader, people might think (because of name), its just some cheap check. Yes, I know there are Javadocs, but IDEs suggestions may lead to use it! I have no yet thought thoroghly about a better name, maybe checkIntegrity() ?
          Hide
          Uwe Schindler added a comment -

          Is CRC32 intrinsic available and the same on all relevant platforms ?

          Does not matter what the JVM does internally. The public API for CRC32 is there in the java.util.zip package and we use it since the early beginning of Lucene. If it would be suddenly return incorrect/different results, it would violate ZIP spec (which is indirectly an ISO standard - through OpenDocument ISO standard).

          Show
          Uwe Schindler added a comment - Is CRC32 intrinsic available and the same on all relevant platforms ? Does not matter what the JVM does internally. The public API for CRC32 is there in the java.util.zip package and we use it since the early beginning of Lucene. If it would be suddenly return incorrect/different results, it would violate ZIP spec (which is indirectly an ISO standard - through OpenDocument ISO standard).
          Hide
          Erick Erickson added a comment -

          bq: I am not so happy about the method name

          Why not a method name that indicates the function used? Something like checkIntegrityUsingRobertsNiftyNewCRC32Code?

          Hmmm, a little long, but what about something like
          validateCRC32
          or
          checkIntegrityCRC32

          Show
          Erick Erickson added a comment - bq: I am not so happy about the method name Why not a method name that indicates the function used? Something like checkIntegrityUsingRobertsNiftyNewCRC32Code? Hmmm, a little long, but what about something like validateCRC32 or checkIntegrityCRC32
          Hide
          Uwe Schindler added a comment - - edited

          Although its currently only doing file integrity checks, we may extend this later! Maybe remove more parts of CheckIndex class and move it to Codec level?

          Another idea: Move validate()/checkIntegrity() up to IndexReader, so it would also work on DirectoryReaders? This would help with cleaning up stuff from CheckIndex.

          Show
          Uwe Schindler added a comment - - edited Although its currently only doing file integrity checks, we may extend this later! Maybe remove more parts of CheckIndex class and move it to Codec level? Another idea: Move validate()/checkIntegrity() up to IndexReader, so it would also work on DirectoryReaders? This would help with cleaning up stuff from CheckIndex.
          Hide
          Robert Muir added a comment -

          I don't want to do this Uwe. I dont want all this checkindex stuff being run at merge. A big point here is to provide an option to prevent detection on merge.

          If you read my description I am unhappy about the current situation because it cannot be enabled by default for performance reasons. I want to fix the codec apis so this is no longer the case. I don't want checkindex code moved into this method.

          Show
          Robert Muir added a comment - I don't want to do this Uwe. I dont want all this checkindex stuff being run at merge. A big point here is to provide an option to prevent detection on merge. If you read my description I am unhappy about the current situation because it cannot be enabled by default for performance reasons. I want to fix the codec apis so this is no longer the case. I don't want checkindex code moved into this method.
          Hide
          Uwe Schindler added a comment -

          Robert, sorry if I throwed some unrelated ideas in here: My idea here has not much to do with the stuffs here. It just came into my mind when I was seeing the validate() method which made me afraid.

          Now just throwing in something, which does not have to affect you while implementing checksumming - just think about it as a separate issue, inspired by this one:

          Uwe's unrelated idea

          In general CheckIndex as a separate class is wrong in my opinion. Especially if it does checks not valid for all codecs (using instanceof checks in CheckIndex code - horrible). A really good CheckIndex should work like fsck - implemented by the file system driver + filesystem code. So the checks currently done by CheckIndex should be done by the codecs (only general checks may work on top - like numDocs checks, etc.)

          Now back to this issue: To come back to the initial suggestion: validate => checkIntegrity. What do you think?

          Show
          Uwe Schindler added a comment - Robert, sorry if I throwed some unrelated ideas in here: My idea here has not much to do with the stuffs here. It just came into my mind when I was seeing the validate() method which made me afraid. Now just throwing in something, which does not have to affect you while implementing checksumming - just think about it as a separate issue, inspired by this one: Uwe's unrelated idea In general CheckIndex as a separate class is wrong in my opinion. Especially if it does checks not valid for all codecs (using instanceof checks in CheckIndex code - horrible). A really good CheckIndex should work like fsck - implemented by the file system driver + filesystem code. So the checks currently done by CheckIndex should be done by the codecs (only general checks may work on top - like numDocs checks, etc.) Now back to this issue: To come back to the initial suggestion: validate => checkIntegrity. What do you think?
          Hide
          Shai Erera added a comment -

          I think checkIntegrity is too generic (i.e. it does not check the integrity of the posting lists, a'la what CheckIndex does. How about validateChecksums, since that's what it currently does? I don't mind if it's renamed in the future if the method validates more things, we shouldn't worry about back-compat with this API.

          Show
          Shai Erera added a comment - I think checkIntegrity is too generic (i.e. it does not check the integrity of the posting lists, a'la what CheckIndex does. How about validateChecksums , since that's what it currently does? I don't mind if it's renamed in the future if the method validates more things, we shouldn't worry about back-compat with this API.
          Hide
          Robert Muir added a comment -

          You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. The only thing i care about is that its not mixed up with other logic.

          Its really important when debugging a corrupt index that you can know that the file itself is corrupted (e.g. hardware) versus some bug in lucene. Its also really important that there is at least the option (like there is in this patch) to prevent this corruption from being propagated in merge. So lets please not mix unrelated stuff in here.

          Things like refactoring checkindex or whatever: please just open a separate issue for that

          Show
          Robert Muir added a comment - You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. The only thing i care about is that its not mixed up with other logic. Its really important when debugging a corrupt index that you can know that the file itself is corrupted (e.g. hardware) versus some bug in lucene. Its also really important that there is at least the option (like there is in this patch) to prevent this corruption from being propagated in merge. So lets please not mix unrelated stuff in here. Things like refactoring checkindex or whatever: please just open a separate issue for that
          Hide
          Shai Erera added a comment -

          You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out.

          Uwe, so what do you think: validateChecksums or checkIntegrity?
          Let's get this thing committed so Jenkins can bless it.

          Show
          Shai Erera added a comment - You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. Uwe, so what do you think: validateChecksums or checkIntegrity ? Let's get this thing committed so Jenkins can bless it.
          Hide
          Robert Muir added a comment -

          checkIntegrity is fine with me. We can also rename it before releasing.

          I'll mark it internal for now...

          Show
          Robert Muir added a comment - checkIntegrity is fine with me. We can also rename it before releasing. I'll mark it internal for now...
          Hide
          Shai Erera added a comment -

          +1

          Show
          Shai Erera added a comment - +1
          Hide
          ASF subversion and git services added a comment -

          Commit 1583550 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1583550 ]

          LUCENE-2446: add checksums to index files

          Show
          ASF subversion and git services added a comment - Commit 1583550 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1583550 ] LUCENE-2446 : add checksums to index files
          Hide
          Uwe Schindler added a comment -

          Thanks Robert!
          checkIntegrity() is in my opinion the best we can have: most generic, but not to easy to misunderstand and run after every commitI was just afraid about the optimize()-loving people!
          Uwe

          Show
          Uwe Schindler added a comment - Thanks Robert! checkIntegrity() is in my opinion the best we can have: most generic, but not to easy to misunderstand and run after every commitI was just afraid about the optimize() -loving people! Uwe
          Hide
          Uwe Schindler added a comment -

          One idea to make the whole thing more hidden to the user:
          I am not sure if we need the abstract checkIntegrity() on the public AtomicReader abstract interface - because it only applies to disk-based indexes. Would it be not enough to have it on SegmentReader? I know we might need some instanceof checks while merging, but I think we do those already. By that it would not be public and we can hide it to the user. We would only not validate on FilterAtomicReader's merging (split / sort indexes).

          Any opinion about this?

          Show
          Uwe Schindler added a comment - One idea to make the whole thing more hidden to the user: I am not sure if we need the abstract checkIntegrity() on the public AtomicReader abstract interface - because it only applies to disk-based indexes. Would it be not enough to have it on SegmentReader? I know we might need some instanceof checks while merging, but I think we do those already. By that it would not be public and we can hide it to the user. We would only not validate on FilterAtomicReader's merging (split / sort indexes). Any opinion about this?
          Hide
          Robert Muir added a comment -

          I don't think that its true it "only applies to disk-based indexes".

          FilterReader/SlowWrapper etc pass this down to their underlying readers so you still get the check on the underlying data e.g. if you are using SortingMergePolicy.

          Show
          Robert Muir added a comment - I don't think that its true it "only applies to disk-based indexes". FilterReader/SlowWrapper etc pass this down to their underlying readers so you still get the check on the underlying data e.g. if you are using SortingMergePolicy.
          Hide
          ASF subversion and git services added a comment -

          Commit 1583565 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1583565 ]

          LUCENE-2446: ensure we close file if we hit exception writing codec header

          Show
          ASF subversion and git services added a comment - Commit 1583565 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1583565 ] LUCENE-2446 : ensure we close file if we hit exception writing codec header
          Hide
          ASF subversion and git services added a comment -

          Commit 1583863 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1583863 ]

          LUCENE-2446: add checksums to index files

          Show
          ASF subversion and git services added a comment - Commit 1583863 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1583863 ] LUCENE-2446 : add checksums to index files
          Hide
          Robert Muir added a comment -

          I will open followups for some of the ideas here.

          Show
          Robert Muir added a comment - I will open followups for some of the ideas here.
          Hide
          Uwe Schindler added a comment -

          Hi Robert,
          I think we should add a Lucene 4.8 TestBackwardsCompatibility index to the test resources. Should I prepare one?

          Show
          Uwe Schindler added a comment - Hi Robert, I think we should add a Lucene 4.8 TestBackwardsCompatibility index to the test resources. Should I prepare one?
          Hide
          Robert Muir added a comment -

          Absolutely not! 4.8 is not released

          Show
          Robert Muir added a comment - Absolutely not! 4.8 is not released
          Hide
          Uwe Schindler added a comment -

          But we should take care of it, maybe open issue! so we add it once 4.8 is released?

          Show
          Uwe Schindler added a comment - But we should take care of it, maybe open issue! so we add it once 4.8 is released?
          Hide
          Robert Muir added a comment -

          It's part of the release workflow. Come on, this is nothing new and no different from any other format change. No need to single out this issue.

          Show
          Robert Muir added a comment - It's part of the release workflow. Come on, this is nothing new and no different from any other format change. No need to single out this issue.

            People

            • Assignee:
              Unassigned
              Reporter:
              Lance Norskog
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development