Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 4.8, 6.0
    • core/index
    • New

    Description

      It would be useful for the different files in a Lucene index to include checksums. This would make it easy to spot corruption while copying index files around; the various cloud efforts assume many more data-copying operations than older single-index implementations.

      This feature might be much easier to implement if all index files are created in a sequential fashion. This issue therefore depends on LUCENE-2373.

      Attachments

        1. LUCENE-2446.patch
          213 kB
          Robert Muir

        Issue Links

          Activity

            rcmuir Robert Muir added a comment -

            I think this is a pretty important issue: besides the case of distributed system copying files around, we have the issue that today there is no integrity mechanism to detect hardware issues (can cause developers to pull hair out trying to debug corruptions), and we have some optimized components doing bulk merge which can propagate corruptions to new segments over a long time.

            Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is intrinsic and uses clmul hardware instructions on x86 and so on.

            I created an initial patch: the last 8 bytes of every file is a zlib-crc32 checksum. We also write some additional metadata before it (its done via CodecUtil.writeFooter) so we can extend it more in the future if we need.

            For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we open, because we are reading the file anyway. So this provides some extra safety.

            For larger files this would be expensive: instead the patch adds AtomicReader.validate() which asks the codec (or filterreader, or whatever), to ensure everything is valid. This is called by e.g. checkindex before decoding.

            Patch adds an option (defaults to off) on IndexWriterConfig to call this before merging. Ideally we wouldnt need this and just validate-as-we-merge, but that requires some codec/merge API changes...

            File format changes are backwards compatible.

            rcmuir Robert Muir added a comment - I think this is a pretty important issue: besides the case of distributed system copying files around, we have the issue that today there is no integrity mechanism to detect hardware issues (can cause developers to pull hair out trying to debug corruptions), and we have some optimized components doing bulk merge which can propagate corruptions to new segments over a long time. Also in recent jvms, computing checksum is fast: e.g. in java8 CRC32 is intrinsic and uses clmul hardware instructions on x86 and so on. I created an initial patch: the last 8 bytes of every file is a zlib-crc32 checksum. We also write some additional metadata before it (its done via CodecUtil.writeFooter) so we can extend it more in the future if we need. For small metadata files (e.g. .fnm, .si, .dvm, ...) we just verify when we open, because we are reading the file anyway. So this provides some extra safety. For larger files this would be expensive: instead the patch adds AtomicReader.validate() which asks the codec (or filterreader, or whatever), to ensure everything is valid. This is called by e.g. checkindex before decoding. Patch adds an option (defaults to off) on IndexWriterConfig to call this before merging. Ideally we wouldnt need this and just validate-as-we-merge, but that requires some codec/merge API changes... File format changes are backwards compatible.
            shaie Shai Erera added a comment -

            This looks really great! +1

            shaie Shai Erera added a comment - This looks really great! +1

            +1, this looks wonderful; it gives us end-to-end checksums, like ZFS.

            This means, when there is a checksum mis-match, we can be quite certain that there's a hardware problem (bit flipper) in the user's env, and not a Lucene bug.

            mikemccand Michael McCandless added a comment - +1, this looks wonderful; it gives us end-to-end checksums, like ZFS. This means, when there is a checksum mis-match, we can be quite certain that there's a hardware problem (bit flipper) in the user's env, and not a Lucene bug.

            this is awesome - robert do you think it would make sense to commit this to a branch and let CI run it for a bit. That way we can clean stuff up further if need as well? I just wonder though since it's a pretty big patch

            simonw Simon Willnauer added a comment - this is awesome - robert do you think it would make sense to commit this to a branch and let CI run it for a bit. That way we can clean stuff up further if need as well? I just wonder though since it's a pretty big patch

            I think the patch is quite solid as is ... we should just commit & let it bake for a while on trunk? We can iterate on further improvements...

            I just ran distributed beasting from luceneutil (runs all Lucene core + modules tests, across 5 machines, 28 paralle, JVMs) ~150 times over and it didn't hit any test failures.

            mikemccand Michael McCandless added a comment - I think the patch is quite solid as is ... we should just commit & let it bake for a while on trunk? We can iterate on further improvements... I just ran distributed beasting from luceneutil (runs all Lucene core + modules tests, across 5 machines, 28 paralle, JVMs) ~150 times over and it didn't hit any test failures.
            mrbsd Bert Sanders added a comment - - edited

            Is CRC32 intrinsic available and the same on all relevant platforms ?
            (I assume it is CRC32C version of Intel)

            If not, what happens ? Does it still work ? Does it need to be cross-platform-compatible ?
            Could it be a concern for performance-related issue ?

            mrbsd Bert Sanders added a comment - - edited Is CRC32 intrinsic available and the same on all relevant platforms ? (I assume it is CRC32C version of Intel) If not, what happens ? Does it still work ? Does it need to be cross-platform-compatible ? Could it be a concern for performance-related issue ?
            rcmuir Robert Muir added a comment -

            java.util.zip.CRC32 has been used by lucene for the segments_N for quite some time.
            This just applies it to other files. And while is nice that it is 3x faster in java 8, it is still 1GB/second with java 7 on my computer.
            By the way, the instruction used in java8 is not CRC32C. it is PCLMULQDQ. If they broke this, it would be a big bug in java

            rcmuir Robert Muir added a comment - java.util.zip.CRC32 has been used by lucene for the segments_N for quite some time. This just applies it to other files. And while is nice that it is 3x faster in java 8, it is still 1GB/second with java 7 on my computer. By the way, the instruction used in java8 is not CRC32C. it is PCLMULQDQ. If they broke this, it would be a big bug in java

            I glanced at the patch again I think it looks pretty good. I didn't look close enough before! +1 to move that in and let it bake

            simonw Simon Willnauer added a comment - I glanced at the patch again I think it looks pretty good. I didn't look close enough before! +1 to move that in and let it bake
            uschindler Uwe Schindler added a comment -

            Great!

            I am not so happy about the method name validate(). Especially as it is in AtomicReader, people might think (because of name), its just some cheap check. Yes, I know there are Javadocs, but IDEs suggestions may lead to use it!

            I have no yet thought thoroghly about a better name, maybe checkIntegrity()?

            uschindler Uwe Schindler added a comment - Great! I am not so happy about the method name validate() . Especially as it is in AtomicReader, people might think (because of name), its just some cheap check. Yes, I know there are Javadocs, but IDEs suggestions may lead to use it! I have no yet thought thoroghly about a better name, maybe checkIntegrity() ?
            uschindler Uwe Schindler added a comment -

            Is CRC32 intrinsic available and the same on all relevant platforms ?

            Does not matter what the JVM does internally. The public API for CRC32 is there in the java.util.zip package and we use it since the early beginning of Lucene. If it would be suddenly return incorrect/different results, it would violate ZIP spec (which is indirectly an ISO standard - through OpenDocument ISO standard).

            uschindler Uwe Schindler added a comment - Is CRC32 intrinsic available and the same on all relevant platforms ? Does not matter what the JVM does internally. The public API for CRC32 is there in the java.util.zip package and we use it since the early beginning of Lucene. If it would be suddenly return incorrect/different results, it would violate ZIP spec (which is indirectly an ISO standard - through OpenDocument ISO standard).
            erickerickson Erick Erickson added a comment -

            bq: I am not so happy about the method name

            Why not a method name that indicates the function used? Something like checkIntegrityUsingRobertsNiftyNewCRC32Code?

            Hmmm, a little long, but what about something like
            validateCRC32
            or
            checkIntegrityCRC32

            erickerickson Erick Erickson added a comment - bq: I am not so happy about the method name Why not a method name that indicates the function used? Something like checkIntegrityUsingRobertsNiftyNewCRC32Code? Hmmm, a little long, but what about something like validateCRC32 or checkIntegrityCRC32
            uschindler Uwe Schindler added a comment - - edited

            Although its currently only doing file integrity checks, we may extend this later! Maybe remove more parts of CheckIndex class and move it to Codec level?

            Another idea: Move validate()/checkIntegrity() up to IndexReader, so it would also work on DirectoryReaders? This would help with cleaning up stuff from CheckIndex.

            uschindler Uwe Schindler added a comment - - edited Although its currently only doing file integrity checks, we may extend this later! Maybe remove more parts of CheckIndex class and move it to Codec level? Another idea: Move validate()/checkIntegrity() up to IndexReader, so it would also work on DirectoryReaders? This would help with cleaning up stuff from CheckIndex.
            rcmuir Robert Muir added a comment -

            I don't want to do this Uwe. I dont want all this checkindex stuff being run at merge. A big point here is to provide an option to prevent detection on merge.

            If you read my description I am unhappy about the current situation because it cannot be enabled by default for performance reasons. I want to fix the codec apis so this is no longer the case. I don't want checkindex code moved into this method.

            rcmuir Robert Muir added a comment - I don't want to do this Uwe. I dont want all this checkindex stuff being run at merge. A big point here is to provide an option to prevent detection on merge. If you read my description I am unhappy about the current situation because it cannot be enabled by default for performance reasons. I want to fix the codec apis so this is no longer the case. I don't want checkindex code moved into this method.
            uschindler Uwe Schindler added a comment -

            Robert, sorry if I throwed some unrelated ideas in here: My idea here has not much to do with the stuffs here. It just came into my mind when I was seeing the validate() method which made me afraid.

            Now just throwing in something, which does not have to affect you while implementing checksumming - just think about it as a separate issue, inspired by this one:

            Uwe's unrelated idea

            In general CheckIndex as a separate class is wrong in my opinion. Especially if it does checks not valid for all codecs (using instanceof checks in CheckIndex code - horrible). A really good CheckIndex should work like fsck - implemented by the file system driver + filesystem code. So the checks currently done by CheckIndex should be done by the codecs (only general checks may work on top - like numDocs checks, etc.)

            Now back to this issue: To come back to the initial suggestion: validate => checkIntegrity. What do you think?

            uschindler Uwe Schindler added a comment - Robert, sorry if I throwed some unrelated ideas in here: My idea here has not much to do with the stuffs here. It just came into my mind when I was seeing the validate() method which made me afraid. Now just throwing in something, which does not have to affect you while implementing checksumming - just think about it as a separate issue, inspired by this one: Uwe's unrelated idea In general CheckIndex as a separate class is wrong in my opinion. Especially if it does checks not valid for all codecs (using instanceof checks in CheckIndex code - horrible). A really good CheckIndex should work like fsck - implemented by the file system driver + filesystem code. So the checks currently done by CheckIndex should be done by the codecs (only general checks may work on top - like numDocs checks, etc.) Now back to this issue: To come back to the initial suggestion: validate => checkIntegrity. What do you think?
            shaie Shai Erera added a comment -

            I think checkIntegrity is too generic (i.e. it does not check the integrity of the posting lists, a'la what CheckIndex does. How about validateChecksums, since that's what it currently does? I don't mind if it's renamed in the future if the method validates more things, we shouldn't worry about back-compat with this API.

            shaie Shai Erera added a comment - I think checkIntegrity is too generic (i.e. it does not check the integrity of the posting lists, a'la what CheckIndex does. How about validateChecksums , since that's what it currently does? I don't mind if it's renamed in the future if the method validates more things, we shouldn't worry about back-compat with this API.
            rcmuir Robert Muir added a comment -

            You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. The only thing i care about is that its not mixed up with other logic.

            Its really important when debugging a corrupt index that you can know that the file itself is corrupted (e.g. hardware) versus some bug in lucene. Its also really important that there is at least the option (like there is in this patch) to prevent this corruption from being propagated in merge. So lets please not mix unrelated stuff in here.

            Things like refactoring checkindex or whatever: please just open a separate issue for that

            rcmuir Robert Muir added a comment - You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. The only thing i care about is that its not mixed up with other logic. Its really important when debugging a corrupt index that you can know that the file itself is corrupted (e.g. hardware) versus some bug in lucene. Its also really important that there is at least the option (like there is in this patch) to prevent this corruption from being propagated in merge. So lets please not mix unrelated stuff in here. Things like refactoring checkindex or whatever: please just open a separate issue for that
            shaie Shai Erera added a comment -

            You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out.

            Uwe, so what do you think: validateChecksums or checkIntegrity?
            Let's get this thing committed so Jenkins can bless it.

            shaie Shai Erera added a comment - You guys figure out the name for the method, i really dont care. I will wait on the issue until you guys bikeshed it out. Uwe, so what do you think: validateChecksums or checkIntegrity ? Let's get this thing committed so Jenkins can bless it.
            rcmuir Robert Muir added a comment -

            checkIntegrity is fine with me. We can also rename it before releasing.

            I'll mark it internal for now...

            rcmuir Robert Muir added a comment - checkIntegrity is fine with me. We can also rename it before releasing. I'll mark it internal for now...
            shaie Shai Erera added a comment -

            +1

            shaie Shai Erera added a comment - +1

            Commit 1583550 from Robert Muir in branch 'dev/trunk'
            [ https://svn.apache.org/r1583550 ]

            LUCENE-2446: add checksums to index files

            jira-bot ASF subversion and git services added a comment - Commit 1583550 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1583550 ] LUCENE-2446 : add checksums to index files
            uschindler Uwe Schindler added a comment -

            Thanks Robert!
            checkIntegrity() is in my opinion the best we can have: most generic, but not to easy to misunderstand and run after every commitI was just afraid about the optimize()-loving people!
            Uwe

            uschindler Uwe Schindler added a comment - Thanks Robert! checkIntegrity() is in my opinion the best we can have: most generic, but not to easy to misunderstand and run after every commitI was just afraid about the optimize() -loving people! Uwe
            uschindler Uwe Schindler added a comment -

            One idea to make the whole thing more hidden to the user:
            I am not sure if we need the abstract checkIntegrity() on the public AtomicReader abstract interface - because it only applies to disk-based indexes. Would it be not enough to have it on SegmentReader? I know we might need some instanceof checks while merging, but I think we do those already. By that it would not be public and we can hide it to the user. We would only not validate on FilterAtomicReader's merging (split / sort indexes).

            Any opinion about this?

            uschindler Uwe Schindler added a comment - One idea to make the whole thing more hidden to the user: I am not sure if we need the abstract checkIntegrity() on the public AtomicReader abstract interface - because it only applies to disk-based indexes. Would it be not enough to have it on SegmentReader? I know we might need some instanceof checks while merging, but I think we do those already. By that it would not be public and we can hide it to the user. We would only not validate on FilterAtomicReader's merging (split / sort indexes). Any opinion about this?
            rcmuir Robert Muir added a comment -

            I don't think that its true it "only applies to disk-based indexes".

            FilterReader/SlowWrapper etc pass this down to their underlying readers so you still get the check on the underlying data e.g. if you are using SortingMergePolicy.

            rcmuir Robert Muir added a comment - I don't think that its true it "only applies to disk-based indexes". FilterReader/SlowWrapper etc pass this down to their underlying readers so you still get the check on the underlying data e.g. if you are using SortingMergePolicy.

            Commit 1583565 from Robert Muir in branch 'dev/trunk'
            [ https://svn.apache.org/r1583565 ]

            LUCENE-2446: ensure we close file if we hit exception writing codec header

            jira-bot ASF subversion and git services added a comment - Commit 1583565 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1583565 ] LUCENE-2446 : ensure we close file if we hit exception writing codec header

            Commit 1583863 from Robert Muir in branch 'dev/branches/branch_4x'
            [ https://svn.apache.org/r1583863 ]

            LUCENE-2446: add checksums to index files

            jira-bot ASF subversion and git services added a comment - Commit 1583863 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1583863 ] LUCENE-2446 : add checksums to index files
            rcmuir Robert Muir added a comment -

            I will open followups for some of the ideas here.

            rcmuir Robert Muir added a comment - I will open followups for some of the ideas here.
            uschindler Uwe Schindler added a comment -

            Hi Robert,
            I think we should add a Lucene 4.8 TestBackwardsCompatibility index to the test resources. Should I prepare one?

            uschindler Uwe Schindler added a comment - Hi Robert, I think we should add a Lucene 4.8 TestBackwardsCompatibility index to the test resources. Should I prepare one?
            rcmuir Robert Muir added a comment -

            Absolutely not! 4.8 is not released

            rcmuir Robert Muir added a comment - Absolutely not! 4.8 is not released
            uschindler Uwe Schindler added a comment -

            But we should take care of it, maybe open issue! so we add it once 4.8 is released?

            uschindler Uwe Schindler added a comment - But we should take care of it, maybe open issue! so we add it once 4.8 is released?
            rcmuir Robert Muir added a comment -

            It's part of the release workflow. Come on, this is nothing new and no different from any other format change. No need to single out this issue.

            rcmuir Robert Muir added a comment - It's part of the release workflow. Come on, this is nothing new and no different from any other format change. No need to single out this issue.
            uschindler Uwe Schindler added a comment -

            Close issue after release of 4.8.0

            uschindler Uwe Schindler added a comment - Close issue after release of 4.8.0
            tomoko Tomoko Uchida added a comment -

            This issue was moved to GitHub issue: #3520.

            tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #3520 .

            People

              Unassigned Unassigned
              lancenorskog Lance Norskog
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Slack