Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.1
    • Component/s: java, spec
    • Labels:
      None

      Description

      A checksum might be included with each compressed block to better detect errors. While some filesystems (e.g. HDFS) may checksum data, not all do. Data files may also accumulate errors when copied between filesystems. For back-compatibility, we cannot easily add checksums to existing data files, but a new codec provides us with the opportunity to do this.

      1. AVRO-798.patch
        3 kB
        Doug Cutting
      2. AVRO-798.patch
        2 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          cutting Doug Cutting added a comment -

          Here's a patch that implements this.

          Show
          cutting Doug Cutting added a comment - Here's a patch that implements this.
          Hide
          cutting Doug Cutting added a comment -

          Patch that also updates spec.

          Show
          cutting Doug Cutting added a comment - Patch that also updates spec.
          Hide
          scott_carey Scott Carey added a comment -

          Should we use PureJavaCRC32? Its a lot faster, though less so for larger arrays.

          http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/PureJavaCrc32.java?revision=953881&view=markup

          We can just copy that code into a class in o.a.a.util.

          Show
          scott_carey Scott Carey added a comment - Should we use PureJavaCRC32? Its a lot faster, though less so for larger arrays. http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/PureJavaCrc32.java?revision=953881&view=markup We can just copy that code into a class in o.a.a.util.
          Hide
          cutting Doug Cutting added a comment -

          At this point I'm more concerned about getting the format right than optimizing performance. If folks agree that CRC32 is a good checksum algorithm, that this computes & stores it correctly, and that storing it at the end of snappy blocks is appropriate, then we might commit this as-is now, so long as it's not horribly slow, and fine-tune performance later. Make sense?

          Is it correct to cast the long returned from CRC32 to an int? That's the biggest concern I currently have about this patch...

          Show
          cutting Doug Cutting added a comment - At this point I'm more concerned about getting the format right than optimizing performance. If folks agree that CRC32 is a good checksum algorithm, that this computes & stores it correctly, and that storing it at the end of snappy blocks is appropriate, then we might commit this as-is now, so long as it's not horribly slow, and fine-tune performance later. Make sense? Is it correct to cast the long returned from CRC32 to an int? That's the biggest concern I currently have about this patch...
          Hide
          cutting Doug Cutting added a comment -

          Any objections to committing this?

          Show
          cutting Doug Cutting added a comment - Any objections to committing this?
          Hide
          cutting Doug Cutting added a comment -

          I committed this.

          Show
          cutting Doug Cutting added a comment - I committed this.

            People

            • Assignee:
              cutting Doug Cutting
              Reporter:
              cutting Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development