Cassandra
  1. Cassandra
  2. CASSANDRA-4165

Generate Digest file for compressed SSTables

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 2.1 beta2
    • Component/s: Core
    • Labels:

      Description

      We use the generated *Digest.sha1-files to verify backups, would be nice if they were generated for compressed sstables as well.

        Activity

        Hide
        Jonathan Ellis added a comment -

        The thinking was, compressed sstables have a per-block checksum, so there's no need to have the less-granular sha.

        Show
        Jonathan Ellis added a comment - The thinking was, compressed sstables have a per-block checksum, so there's no need to have the less-granular sha.
        Hide
        Marcus Eriksson added a comment -

        yes, but when building external tools (like our backup validation thing), it would be nice to not have special cases for compressed cfs

        Show
        Marcus Eriksson added a comment - yes, but when building external tools (like our backup validation thing), it would be nice to not have special cases for compressed cfs
        Hide
        Marcus Eriksson added a comment - - edited

        if you have already discussed this, just close this issue, ill maintain the patch internally

        Show
        Marcus Eriksson added a comment - - edited if you have already discussed this, just close this issue, ill maintain the patch internally
        Hide
        Sylvain Lebresne added a comment -

        We don't really have discussed it more than the reasoning Jonathan explained . But if it's for external tools, is it still useful to have it computed during the sstable write (i.e, you could generate the sha1 yourself before backupping the file in the first place)? Not that it's much work for us to do it (well, except for the added cpu usage during sstable write maybe).

        Show
        Sylvain Lebresne added a comment - We don't really have discussed it more than the reasoning Jonathan explained . But if it's for external tools, is it still useful to have it computed during the sstable write (i.e, you could generate the sha1 yourself before backupping the file in the first place)? Not that it's much work for us to do it (well, except for the added cpu usage during sstable write maybe).
        Hide
        Jonathan Ellis added a comment -

        We've had CASSANDRA-5791 come up as well, so I think it's reasonable to say that this is useful for more than just Spotify.

        Attached is Marcus's patch rebased to 2.0, but we may need to dig a little deeper; I'm not quite sure what the right treatment of the CRC component is, but it appears to duplicate the inline checksum that we compute for the compressed writer which seems odd. Git places the blame on CASSANDRA-3648.

        Any light to shed Vijay Jason Brown?

        Show
        Jonathan Ellis added a comment - We've had CASSANDRA-5791 come up as well, so I think it's reasonable to say that this is useful for more than just Spotify. Attached is Marcus's patch rebased to 2.0, but we may need to dig a little deeper; I'm not quite sure what the right treatment of the CRC component is, but it appears to duplicate the inline checksum that we compute for the compressed writer which seems odd. Git places the blame on CASSANDRA-3648 . Any light to shed Vijay Jason Brown ?
        Hide
        Vijay added a comment -

        Hi Jonathan, 3648 actually adds block level CRC for uncompressed files and writes to a separate file (CRC.db), and uses it during the streaming parts of the file to validate before streaming (not during normal reads). Hence we need 2 Checksums during the flush 1 for blocks and the md5 for the whole file.

        Show
        Vijay added a comment - Hi Jonathan, 3648 actually adds block level CRC for uncompressed files and writes to a separate file (CRC.db), and uses it during the streaming parts of the file to validate before streaming (not during normal reads). Hence we need 2 Checksums during the flush 1 for blocks and the md5 for the whole file.
        Hide
        Jonathan Ellis added a comment -

        So we have

        1. Block-level CRC, in a CRC component, for uncompressed files
        2. File-level hash, in a Digest component, for uncompressed files
        3. Block-level CRC, inline in the Data component, for compressed files
        4. [proposed] file-level hash, in a Digest component, for compressed files

        So at the least we should clean up the patch here to not also generate a CRC component for compressed files. Not sure what else we can do to clean it up ... I'm not sure I'm a fan of the CRC component to begin with, but removing it (in favor of a CRCOnlyCompression, for instance) is out of scope here.

        Show
        Jonathan Ellis added a comment - So we have Block-level CRC, in a CRC component, for uncompressed files File-level hash, in a Digest component, for uncompressed files Block-level CRC, inline in the Data component, for compressed files [proposed] file-level hash, in a Digest component, for compressed files So at the least we should clean up the patch here to not also generate a CRC component for compressed files. Not sure what else we can do to clean it up ... I'm not sure I'm a fan of the CRC component to begin with, but removing it (in favor of a CRCOnlyCompression, for instance) is out of scope here.
        Hide
        Jonathan Ellis added a comment -

        It turns out that uncompressed writes spend about 20% of their time computing the sha.

        I think we should switch to adler here as well as at the block level. (I note that adler-capable commandline tools exist for RHEL and Debian as well as scripting languages Perl, Python, Ruby.)

        (Still need to avoid computing redundant CRC component for compressed files too.)

        /cc Piotr Kołaczkowski Benedict

        Show
        Jonathan Ellis added a comment - It turns out that uncompressed writes spend about 20% of their time computing the sha. I think we should switch to adler here as well as at the block level. (I note that adler-capable commandline tools exist for RHEL and Debian as well as scripting languages Perl, Python, Ruby.) (Still need to avoid computing redundant CRC component for compressed files too.) /cc Piotr Kołaczkowski Benedict
        Hide
        Radovan Zvoncek added a comment -

        Any update on this? I have written a patch for CASSANDRA-5791 but it assumes having digests for compressed files available.

        Show
        Radovan Zvoncek added a comment - Any update on this? I have written a patch for CASSANDRA-5791 but it assumes having digests for compressed files available.
        Hide
        Jonathan Ellis added a comment -

        No update. Feel free to pick it up.

        Show
        Jonathan Ellis added a comment - No update. Feel free to pick it up.
        Hide
        Radovan Zvoncek added a comment -

        So I used Marcuse's patch as a basis and added checks preventing computation/writing of CRC components for compressed files.

        I'm still worried a bit about backwards compatibility though. I did some experiments and didn't encounter any problems. However, I can't fully comprehend the impact of this change and might have missed something.

        Show
        Radovan Zvoncek added a comment - So I used Marcuse's patch as a basis and added checks preventing computation/writing of CRC components for compressed files. I'm still worried a bit about backwards compatibility though. I did some experiments and didn't encounter any problems. However, I can't fully comprehend the impact of this change and might have missed something.
        Hide
        Jonathan Ellis added a comment -

        https://github.com/jbellis/cassandra/commits/4165-3 refactors and switches to an adler32 digest, WDYT?

        Show
        Jonathan Ellis added a comment - https://github.com/jbellis/cassandra/commits/4165-3 refactors and switches to an adler32 digest, WDYT?
        Hide
        Mikhail Stepura added a comment -

        Marking as "Patch available"

        Show
        Mikhail Stepura added a comment - Marking as "Patch available"
        Hide
        Jonathan Ellis added a comment -

        Can you review that branch, Marcus Eriksson?

        Show
        Jonathan Ellis added a comment - Can you review that branch, Marcus Eriksson ?
        Hide
        Marcus Eriksson added a comment -

        rebased and a fix here: https://github.com/krummas/cassandra/commits/jbellis/4165-3 (the Digest component for compressed files was not written)

        Show
        Marcus Eriksson added a comment - rebased and a fix here: https://github.com/krummas/cassandra/commits/jbellis/4165-3 (the Digest component for compressed files was not written)
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Marcus Eriksson
            Reviewer:
            Marcus Eriksson
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development