Hadoop Common
  1. Hadoop Common
  2. HADOOP-7444

Add Checksum API to verify and calculate checksums "in bulk"

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently, the various checksum types only provide the capability to calculate the checksum of a range of a byte array. For HDFS-2080, it's advantageous to provide an API that, given a buffer with some number of "checksum chunks", can either calculate or verify the checksums of all of the chunks. For example, given a 4KB buffer and a 512-byte chunk size, it would calculate or verify 8 CRC32s in one call.

      This allows efficient JNI-based checksum implementations since the cost of crossing the JNI boundary is amortized across many computations.

      1. hadoop-7444.txt
        7 kB
        Todd Lipcon
      2. hadoop-7444.txt
        12 kB
        Todd Lipcon
      3. hadoop-7444.txt
        12 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Fairly simple patch implementing this API.

          Unfortunately it wasn't straightforward to use the new APIs from FSInputChecker, since that class only has a Checksum instance and not a DataChecksum instance. We can tackle that as an improvement later.

          Show
          Todd Lipcon added a comment - Fairly simple patch implementing this API. Unfortunately it wasn't straightforward to use the new APIs from FSInputChecker, since that class only has a Checksum instance and not a DataChecksum instance. We can tackle that as an improvement later.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12485374/hadoop-7444.txt
          against trunk revision 1143219.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/698//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/698//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/698//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12485374/hadoop-7444.txt against trunk revision 1143219. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/698//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/698//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/698//console This message is automatically generated.
          Hide
          Todd Lipcon added a comment -

          Improved patch which avoids copies when passed ByteBuffers that are array-backed.

          Show
          Todd Lipcon added a comment - Improved patch which avoids copies when passed ByteBuffers that are array-backed.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12485453/hadoop-7444.txt
          against trunk revision 1143491.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/702//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/702//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/702//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12485453/hadoop-7444.txt against trunk revision 1143491. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/702//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/702//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/702//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Looks good. Some comments.

          • Why cast a long to an int and then put the int in a long?
                  long calculated = (int)summer.getValue();
            

            It won't work for 64-bit checksums.

          • In line 357, calculated is redundant.
                    long calculated = (int)summer.getValue();
                    checksums.putInt((int)calculated);
            
          • Line 382 to 385, & 0xff are redundant.
                  sums[sumsOffset++] = (byte) ((calculated >> 24) & 0xff);
                  sums[sumsOffset++] = (byte) ((calculated >> 16) & 0xff);
                  sums[sumsOffset++] = (byte) ((calculated >> 8) & 0xff);
                  sums[sumsOffset++] = (byte) ((calculated) & 0xff);
            
          • I would avoid using Math.min(..).
          Show
          Tsz Wo Nicholas Sze added a comment - Looks good. Some comments. Why cast a long to an int and then put the int in a long? long calculated = ( int )summer.getValue(); It won't work for 64-bit checksums. In line 357, calculated is redundant. long calculated = ( int )summer.getValue(); checksums.putInt(( int )calculated); Line 382 to 385, & 0xff are redundant. sums[sumsOffset++] = ( byte ) ((calculated >> 24) & 0xff); sums[sumsOffset++] = ( byte ) ((calculated >> 16) & 0xff); sums[sumsOffset++] = ( byte ) ((calculated >> 8) & 0xff); sums[sumsOffset++] = ( byte ) ((calculated) & 0xff); I would avoid using Math.min(..) .
          Hide
          Todd Lipcon added a comment -
          Show
          Todd Lipcon added a comment - fixed the redundant casts and bitmasks, thanks left the Math.min call, since it actually is a VM intrinsic in hotspot: http://google.com/codesearch#NZZBPLyUxqs/src/share/vm/opto/library_call.cpp&q=hotspot%20intrinsics%20min&type=cs&l=1499
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12486254/hadoop-7444.txt
          against trunk revision 1145839.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/721//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/721//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/721//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486254/hadoop-7444.txt against trunk revision 1145839. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/721//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HADOOP-Build/721//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/721//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 patch looks good. It won't work for 64-bit checksum. I think it is fine, provided that the existing code may also not work for 64-bit checksums.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 patch looks good. It won't work for 64-bit checksum. I think it is fine, provided that the existing code may also not work for 64-bit checksums.
          Hide
          Todd Lipcon added a comment -

          Committed to trunk, thanks for review, Nicholas. I agree that the current code in DataChecksum is already specific to 32-bit checksums.

          Show
          Todd Lipcon added a comment - Committed to trunk, thanks for review, Nicholas. I agree that the current code in DataChecksum is already specific to 32-bit checksums.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk-Commit #688 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/688/)
          HADOOP-7444. Add Checksum API to verify and calculate checksums "in bulk". Contributed by Todd Lipcon.

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146111
          Files :

          • /hadoop/common/trunk/common/CHANGES.txt
          • /hadoop/common/trunk/common/src/test/core/org/apache/hadoop/util/TestDataChecksum.java
          • /hadoop/common/trunk/common/src/java/org/apache/hadoop/util/DataChecksum.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #688 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/688/ ) HADOOP-7444 . Add Checksum API to verify and calculate checksums "in bulk". Contributed by Todd Lipcon. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146111 Files : /hadoop/common/trunk/common/CHANGES.txt /hadoop/common/trunk/common/src/test/core/org/apache/hadoop/util/TestDataChecksum.java /hadoop/common/trunk/common/src/java/org/apache/hadoop/util/DataChecksum.java
          Hide
          Jason Rutherglen added a comment -

          It would be great if DFSClient (or some other class) offered an API to verify a file manually. Right now DFSClient offers the getFileChecksum method, though there's no way to verify the CRC against the file (eg, I think it's all under the hood right now).

          Show
          Jason Rutherglen added a comment - It would be great if DFSClient (or some other class) offered an API to verify a file manually. Right now DFSClient offers the getFileChecksum method, though there's no way to verify the CRC against the file (eg, I think it's all under the hood right now).
          Hide
          Doug Cutting added a comment -

          Copying a file to /dev/null will verify it.

          Show
          Doug Cutting added a comment - Copying a file to /dev/null will verify it.
          Hide
          Jason Rutherglen added a comment -

          @Doug Right, basically read the stream until completion. I guess that's one way to do it! I was thinking a method that returned a boolean, like verifyFile(Path).

          Show
          Jason Rutherglen added a comment - @Doug Right, basically read the stream until completion. I guess that's one way to do it! I was thinking a method that returned a boolean, like verifyFile(Path).
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #746 (See https://builds.apache.org/job/Hadoop-Common-trunk/746/)
          HADOOP-7444. Add Checksum API to verify and calculate checksums "in bulk". Contributed by Todd Lipcon.

          todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146111
          Files :

          • /hadoop/common/trunk/common/CHANGES.txt
          • /hadoop/common/trunk/common/src/test/core/org/apache/hadoop/util/TestDataChecksum.java
          • /hadoop/common/trunk/common/src/java/org/apache/hadoop/util/DataChecksum.java
          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #746 (See https://builds.apache.org/job/Hadoop-Common-trunk/746/ ) HADOOP-7444 . Add Checksum API to verify and calculate checksums "in bulk". Contributed by Todd Lipcon. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1146111 Files : /hadoop/common/trunk/common/CHANGES.txt /hadoop/common/trunk/common/src/test/core/org/apache/hadoop/util/TestDataChecksum.java /hadoop/common/trunk/common/src/java/org/apache/hadoop/util/DataChecksum.java

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development