Hadoop Common
  1. Hadoop Common
  2. HADOOP-3981

Need a distributed file checksum algorithm for HDFS

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed
    • Release Note:
      Hide
      Implemented MD5-of-xxxMD5-of-yyyCRC32 which is a distributed file checksum algorithm for HDFS, where xxx is the number of CRCs per block and yyy is the number of bytes per CRC.

      Changed DistCp to use file checksum for comparing files if both source and destination FileSystem(s) support getFileChecksum(...).
      Show
      Implemented MD5-of-xxxMD5-of-yyyCRC32 which is a distributed file checksum algorithm for HDFS, where xxx is the number of CRCs per block and yyy is the number of bytes per CRC. Changed DistCp to use file checksum for comparing files if both source and destination FileSystem(s) support getFileChecksum(...).

      Description

      Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location. HDFS supports large files with multiple tera bytes. The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

      1. 3981_20080909.patch
        22 kB
        Tsz Wo Nicholas Sze
      2. 3981_20080910.patch
        29 kB
        Tsz Wo Nicholas Sze
      3. 3981_20080910b.patch
        29 kB
        Tsz Wo Nicholas Sze
      4. 3981_20080912.patch
        28 kB
        Tsz Wo Nicholas Sze

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #55 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/55/)
          Delete the empty file src/core/org/apache/hadoop/fs/LengthFileChecksum.java which was removed by .

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #55 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/55/ ) Delete the empty file src/core/org/apache/hadoop/fs/LengthFileChecksum.java which was removed by .
          Owen O'Malley made changes -
          Component/s dfs [ 12310710 ]
          Nigel Daley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #611 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/ )
          Tsz Wo Nicholas Sze made changes -
          Link This issue is related to HADOOP-4197 [ HADOOP-4197 ]
          Tsz Wo Nicholas Sze made changes -
          Hadoop Flags [Reviewed] [Incompatible change, Reviewed]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          This is an incompatible change since the DATA_TRANSFER_VERSION should be changed. See HADOOP-4197.

          Show
          Tsz Wo Nicholas Sze added a comment - This is an incompatible change since the DATA_TRANSFER_VERSION should be changed. See HADOOP-4197 .
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #605 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/605/ )
          Tsz Wo Nicholas Sze made changes -
          Link This issue is related to HADOOP-4176 [ HADOOP-4176 ]
          Tsz Wo Nicholas Sze made changes -
          Resolution Fixed [ 1 ]
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Fix Version/s 0.19.0 [ 12313211 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I just committed this.

          Show
          Tsz Wo Nicholas Sze added a comment - I just committed this.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12390040/3981_20080912.patch
          against trunk revision 695690.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12390040/3981_20080912.patch against trunk revision 695690. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          In case that Hudson is not working, here is the ant test-patch results:

               [exec] +1 overall.  
          
               [exec]     +1 @author.  The patch does not contain any @author tags.
          
               [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
          
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
          
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
          
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
          
          Show
          Tsz Wo Nicholas Sze added a comment - In case that Hudson is not working, here is the ant test-patch results: [exec] +1 overall. [exec] +1 @author. The patch does not contain any @author tags. [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          Tsz Wo Nicholas Sze made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Passed all tests locally, submitting...

          Show
          Tsz Wo Nicholas Sze added a comment - Passed all tests locally, submitting...
          Tsz Wo Nicholas Sze made changes -
          Attachment 3981_20080912.patch [ 12390040 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3981_20080912.patch: fixed a bug.

          Show
          Tsz Wo Nicholas Sze added a comment - 3981_20080912.patch: fixed a bug.
          Tsz Wo Nicholas Sze made changes -
          Release Note Implemented MD5-of-xxxMD5-of-yyyCRC32 which is a distributed file checksum algorithm for HDFS, where xxx is the number of CRCs per block and yyy is the number of bytes per CRC.

          Changed DistCp to use file checksum for comparing files if both source and destination FileSystem(s) support getFileChecksum(...).
          Tsz Wo Nicholas Sze made changes -
          Attachment 3981_20080910b.patch [ 12389892 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3981_20080910b.patch:

          Regarding to Raghu's comments
          1. Removed
          2 & 3. Changed the messages
          4. Moved toByteArray() to WritableUtils
          5. Changed the buffer to 4kB

          Also fixed some bugs.

          Show
          Tsz Wo Nicholas Sze added a comment - 3981_20080910b.patch: Regarding to Raghu's comments 1. Removed 2 & 3. Changed the messages 4. Moved toByteArray() to WritableUtils 5. Changed the buffer to 4kB Also fixed some bugs.
          Hide
          Raghu Angadi added a comment -

          Mostly minor :

          1. DFSClient : sock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); is not required.
          2. DFSClient : LOG.warn("src=" + src, ie); is pretty vague. It could include datanode as well.
          3. DFSClient : IOException("Fail to get block MD5 for block (=" + block) could be " IOException("Failed to get block MD5 for " + block)
          4. I don't see much need for toByteArray() : -0.25. at least it could move to some utils.
          5. 1MB buffer for MD5Hash.digest() seems too big. Any CPU benefits with MD5? This is another utility function written as member.
          Show
          Raghu Angadi added a comment - Mostly minor : DFSClient : sock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); is not required. DFSClient : LOG.warn("src=" + src, ie); is pretty vague. It could include datanode as well. DFSClient : IOException("Fail to get block MD5 for block (=" + block) could be " IOException("Failed to get block MD5 for " + block) I don't see much need for toByteArray() : -0.25. at least it could move to some utils. 1MB buffer for MD5Hash.digest() seems too big. Any CPU benefits with MD5? This is another utility function written as member.
          Raghu Angadi made changes -
          Assignee Tsz Wo (Nicholas), SZE [ szetszwo ]
          Hide
          Doug Cutting added a comment -

          +1 This looks good to me.

          Show
          Doug Cutting added a comment - +1 This looks good to me.
          Tsz Wo Nicholas Sze made changes -
          Attachment 3981_20080910.patch [ 12389862 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3981_20080910.patch:

          > It looks like you forgot to include the class MD5MD5CRC32FileChecksum in the patch.
          added MD5MD5CRC32FileChecksum

          > Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol?
          There might be a RPC timeout problem. So I use data transfer protocol instead of ClientDatanodeProtocol.

          > WritableUtils#toByteArray can use io.DataOutputBuffer, no?
          Moved WritableUtils#toByteArray to io.DataOutputBuffer and implemented it with DataOutputBuffer.

          > DistCp#sameFile() should be changed to only get checksums when the lengths differ.
          Updated DistCp

          Show
          Tsz Wo Nicholas Sze added a comment - 3981_20080910.patch: > It looks like you forgot to include the class MD5MD5CRC32FileChecksum in the patch. added MD5MD5CRC32FileChecksum > Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol? There might be a RPC timeout problem. So I use data transfer protocol instead of ClientDatanodeProtocol. > WritableUtils#toByteArray can use io.DataOutputBuffer, no? Moved WritableUtils#toByteArray to io.DataOutputBuffer and implemented it with DataOutputBuffer. > DistCp#sameFile() should be changed to only get checksums when the lengths differ. Updated DistCp
          Hide
          Raghu Angadi added a comment -

          > Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol?

          Nicholas had briefly talk to me regd this. I was ok with either way. If RPCs are used, then other RPCs on the port should be prepared to handle delays on the order of minutes, since these checksum RPCs compete with the rest of the disk accesses. And there could be quite a few these requests.

          Datanode has just 3 RPC handlers.. we probably should not increase the handlers for this reason since checksum load would be very rare and DataNode is thread starved already.

          Show
          Raghu Angadi added a comment - > Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol? Nicholas had briefly talk to me regd this. I was ok with either way. If RPCs are used, then other RPCs on the port should be prepared to handle delays on the order of minutes, since these checksum RPCs compete with the rest of the disk accesses. And there could be quite a few these requests. Datanode has just 3 RPC handlers.. we probably should not increase the handlers for this reason since checksum load would be very rare and DataNode is thread starved already.
          Hide
          Doug Cutting added a comment -

          Some comments:

          • It looks like you forgot to include the class MD5MD5CRC32FileChecksum in the patch.
          • Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol?
          • WritableUtils#toByteArray can use io.DataOutputBuffer, no?
          • DistCp#sameFile() should be changed to only get checksums when the lengths differ.
          Show
          Doug Cutting added a comment - Some comments: It looks like you forgot to include the class MD5MD5CRC32FileChecksum in the patch. Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol? WritableUtils#toByteArray can use io.DataOutputBuffer, no? DistCp#sameFile() should be changed to only get checksums when the lengths differ.
          Tsz Wo Nicholas Sze made changes -
          Attachment 3981_20080909.patch [ 12389788 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          3981_20080909.patch: implement MD5-of-xxxMD5-of-yyyCRC, where xxx is crc per block and yyy is bytes per crc. Need to fix some bugs.

          Show
          Tsz Wo Nicholas Sze added a comment - 3981_20080909.patch: implement MD5-of-xxxMD5-of-yyyCRC, where xxx is crc per block and yyy is bytes per crc. Need to fix some bugs.
          Hide
          Doug Cutting added a comment -

          > implement MD5-of-CRC32-every-512bytes-with-64Mblocks and use it as the default file checksum algorithm for all FileSystem?

          We should just implement this for HDFS, where CRCs already exist.

          > Should we return a list of MD5 [ ... ] ?

          No, just a single checksum for the entire file.

          > If bytes.per.checksum is not 512 or block size is not 64MB in HDFS, how about getFileChecksum(Path f) returns null?

          No, it should return a different algorithm string, with the file's bytes.per.checksum and block size. I now think returning null by default is probably best, rather than having a default implementation that uses file length, since we should check file lengths explicitly and only compare checksums when lengths differ.

          > For other FS, how about we return null?

          Yes, I agree.

          Show
          Doug Cutting added a comment - > implement MD5-of-CRC32-every-512bytes-with-64Mblocks and use it as the default file checksum algorithm for all FileSystem? We should just implement this for HDFS, where CRCs already exist. > Should we return a list of MD5 [ ... ] ? No, just a single checksum for the entire file. > If bytes.per.checksum is not 512 or block size is not 64MB in HDFS, how about getFileChecksum(Path f) returns null? No, it should return a different algorithm string, with the file's bytes.per.checksum and block size. I now think returning null by default is probably best, rather than having a default implementation that uses file length, since we should check file lengths explicitly and only compare checksums when lengths differ. > For other FS, how about we return null? Yes, I agree.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          How about we implement MD5-of-CRC32-every-512bytes-with-64Mblocks and use it as the default file checksum algorithm for all FileSystem? Then, we don't have to change FileSystem API at this moment.

          A few issues:

          • Should we return a list of MD5 (one per block, the length of the file checksum will depend on the number of blocks) or a fixed length checksum (e.g. MD5-of-MD5-of-CRC32)?
          • If bytes.per.checksum is not 512 or block size is not 64MB in HDFS, how about getFileChecksum(Path f) returns null?
          • For other FS, how about we return null? We should implement a serial version of MD5-of-CRC32-every-512bytes-with-64Mblocks algorithm later.
          Show
          Tsz Wo Nicholas Sze added a comment - How about we implement MD5-of-CRC32-every-512bytes-with-64Mblocks and use it as the default file checksum algorithm for all FileSystem? Then, we don't have to change FileSystem API at this moment. A few issues: Should we return a list of MD5 (one per block, the length of the file checksum will depend on the number of blocks) or a fixed length checksum (e.g. MD5-of-MD5-of-CRC32)? If bytes.per.checksum is not 512 or block size is not 64MB in HDFS, how about getFileChecksum(Path f) returns null? For other FS, how about we return null? We should implement a serial version of MD5-of-CRC32-every-512bytes-with-64Mblocks algorithm later.
          Hide
          Doug Cutting added a comment -

          > Which API are you talking about, FileSystem API or HDFS API?

          FileSystem. And that second method above should have been:

          • Checksum getChecksum(Path, String algorithm)

          But this is probably overkill for now. Let's just choose a single algorithm and put its parameters in the algorithm name for now. We have to change the ClientDatanodeProtocol in any case. We can either compute per-block checksums on the datanode, or send CRCs to the client and sum them there. Let's just pick one or the other though for the first version. My preference would be to compute per-block checksums on the datanode, but I don't feel that strongly and would not veto the other approach.

          Show
          Doug Cutting added a comment - > Which API are you talking about, FileSystem API or HDFS API? FileSystem. And that second method above should have been: Checksum getChecksum(Path, String algorithm) But this is probably overkill for now. Let's just choose a single algorithm and put its parameters in the algorithm name for now. We have to change the ClientDatanodeProtocol in any case. We can either compute per-block checksums on the datanode, or send CRCs to the client and sum them there. Let's just pick one or the other though for the first version. My preference would be to compute per-block checksums on the datanode, but I don't feel that strongly and would not veto the other approach.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes

          +1 We definitely need to encode these details in the algorithm name.

          > Another API to consider ...

          Which API are you talking about, FileSystem API or HDFS API? If you mean HDFS API, are you saying that we should handle HDFS specially in DistCp? Currently, DistCp only uses FileSystem API.

          Show
          Tsz Wo Nicholas Sze added a comment - > Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes +1 We definitely need to encode these details in the algorithm name. > Another API to consider ... Which API are you talking about, FileSystem API or HDFS API? If you mean HDFS API, are you saying that we should handle HDFS specially in DistCp? Currently, DistCp only uses FileSystem API.
          Hide
          Doug Cutting added a comment -

          > Otherwise the file checksum computed depends on the block size.

          It still depends on bytes.per.checksum, which can vary per file, just like block size. If two files have different bytes.per.checksum then we should not compare CRC-derived checksums. Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes could be an algorithm name. If we compute these per-block, then the algorithm name would be MD5-of-CRC32-every-512bytes-with-64Mblocks.

          If we compute checksums on demand from CRCs then it will be relatively slow. Distcp thus needs to be sure to only get checksums when lengths match and the alternative is copying the entire file. So long as distcp is the primary client of checksums this is probably sufficient and we should not bother storing checksums.

          Another API to consider might be:

          • String[] getChecksumAlgorithms(Path)
          • Checksum getChecksum(Path)

          This way an HDFS filesystem might return ["MD5-of-CRC32-every-512bytes-with-64Mblocks", "MD5-of-CRC32-every-512bytes", "MD5"] the possible algorithms for a file in preferred order. Then Distcp could call this for two files (whose lengths match) to see if they have any compatible algorithms. If possible, CRC's would be combined on datanodes, but, if block sizes differ, the CRCs could be summed in the client. If the CRCs are incompatible, then MD5s could be computed on datanodes. Is this overkill? Probably.

          Show
          Doug Cutting added a comment - > Otherwise the file checksum computed depends on the block size. It still depends on bytes.per.checksum, which can vary per file, just like block size. If two files have different bytes.per.checksum then we should not compare CRC-derived checksums. Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes could be an algorithm name. If we compute these per-block, then the algorithm name would be MD5-of-CRC32-every-512bytes-with-64Mblocks. If we compute checksums on demand from CRCs then it will be relatively slow. Distcp thus needs to be sure to only get checksums when lengths match and the alternative is copying the entire file. So long as distcp is the primary client of checksums this is probably sufficient and we should not bother storing checksums. Another API to consider might be: String[] getChecksumAlgorithms(Path) Checksum getChecksum(Path) This way an HDFS filesystem might return ["MD5-of-CRC32-every-512bytes-with-64Mblocks", "MD5-of-CRC32-every-512bytes", "MD5"] the possible algorithms for a file in preferred order. Then Distcp could call this for two files (whose lengths match) to see if they have any compatible algorithms. If possible, CRC's would be combined on datanodes, but, if block sizes differ, the CRCs could be summed in the client. If the CRCs are incompatible, then MD5s could be computed on datanodes. Is this overkill? Probably.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to compute them from the CRC's. Even if they are to be pre-computed, then we might still use the CRCs, to reduce FileSystem upgrade time.

          It is better to compute file checksum on-demand, so that the Datanode storage layout remains unchanged and we won't have to do distributed upgrade.

          My hunch is that we should compute them on demand from CRC data. We extend ClientDatanodeProtocol to add a getChecksum() operation that returns the checksum for a block without transmitting the CRCs to the client, and the client combines block checksums to get a whole-file checksum. This is rather expensive, but still a lot faster than checksumming the entire file on demand.

          My idea is similar to this except that we should not compute block checksum. Otherwise the file checksum computed depends on the block size. That is the reason that I propose to compute the second level CRCs over the first level CRCs. This idea is borrowed from hash tree (aka Merkle Trees), which is used by ZFS.

          Show
          Tsz Wo Nicholas Sze added a comment - When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to compute them from the CRC's. Even if they are to be pre-computed, then we might still use the CRCs, to reduce FileSystem upgrade time. It is better to compute file checksum on-demand, so that the Datanode storage layout remains unchanged and we won't have to do distributed upgrade. My hunch is that we should compute them on demand from CRC data. We extend ClientDatanodeProtocol to add a getChecksum() operation that returns the checksum for a block without transmitting the CRCs to the client, and the client combines block checksums to get a whole-file checksum. This is rather expensive, but still a lot faster than checksumming the entire file on demand. My idea is similar to this except that we should not compute block checksum. Otherwise the file checksum computed depends on the block size. That is the reason that I propose to compute the second level CRCs over the first level CRCs. This idea is borrowed from hash tree (aka Merkle Trees), which is used by ZFS.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Why not just use the MD5 or SHA1 of the CRCs?

          MD5 requires sequential access of data. One easy implementation of MD5-over-CRCs is that client read all CRCs from Datanodes and then compute MD5 over them. However, it requires reading all the first level CRCs, which is 800MB for a 100GB file. Is it too much network traffic?

          Raghu has a very good idea for another implementation, which computes MD5 across datanodes as follow: Client initiates the Datanode 1 (which has the first block) to compute MD5. Datanode 1 returns the intermediate status of MD5 computation to the Client and the Client send the intermediate states to Datanode 2 (which has the second block). Then, the Datanode 2 continues the MD5 computation and return the MD5 computation intermediate status to the Client, and so on.

          Note that this is not a parallel algorithm although it is a distributed algorithm. Another problem for MD5 in this implemenation is that there is no easy way to get the MD5 computation intermediate status in Java 1.6.

          It is more appealing to have a small, fixed size checksum.

          This is probably good. I will think about this.

          Show
          Tsz Wo Nicholas Sze added a comment - Why not just use the MD5 or SHA1 of the CRCs? MD5 requires sequential access of data. One easy implementation of MD5-over-CRCs is that client read all CRCs from Datanodes and then compute MD5 over them. However, it requires reading all the first level CRCs, which is 800MB for a 100GB file. Is it too much network traffic? Raghu has a very good idea for another implementation, which computes MD5 across datanodes as follow: Client initiates the Datanode 1 (which has the first block) to compute MD5. Datanode 1 returns the intermediate status of MD5 computation to the Client and the Client send the intermediate states to Datanode 2 (which has the second block). Then, the Datanode 2 continues the MD5 computation and return the MD5 computation intermediate status to the Client, and so on. Note that this is not a parallel algorithm although it is a distributed algorithm. Another problem for MD5 in this implemenation is that there is no easy way to get the MD5 computation intermediate status in Java 1.6. It is more appealing to have a small, fixed size checksum. This is probably good. I will think about this.
          Hide
          Raghu Angadi added a comment -

          > Why not just use the MD5 or SHA1 of the CRCs?

          +1. It is more appealing to have a small, fixed size checksum.

          Issues regd differences in bytes.per.checksum, blockSize etc are essentially same either with fixed length checksum or with 2nd level checksum that Nicholas proposed.

          Show
          Raghu Angadi added a comment - > Why not just use the MD5 or SHA1 of the CRCs? +1. It is more appealing to have a small, fixed size checksum. Issues regd differences in bytes.per.checksum, blockSize etc are essentially same either with fixed length checksum or with 2nd level checksum that Nicholas proposed.
          Hide
          Doug Cutting added a comment -

          > We use these 6.25MB second level CRCs as the checksum of the entire file.

          Why not just use the MD5 or SHA1 of the CRCs?

          When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to compute them from the CRC's. Even if they are to be pre-computed, then we might still use the CRCs, to reduce FileSystem upgrade time.

          If checksums were pre-computed, where would they be stored? We could store them in the NameNode, with file metadata, or we could store per-block checksums on datanodes.

          My hunch is that we should compute them on demand from CRC data. We extend ClientDatanodeProtocol to add a getChecksum() operation that returns the checksum for a block without transmitting the CRCs to the client, and the client combines block checksums to get a whole-file checksum. This is rather expensive, but still a lot faster than checksumming the entire file on demand. DistCp would be substantially faster if it only used checksums when file lengths match, so we should probably make that optimization.

          Longer-term we could think about a checksum API that permits a sequence of checksums to be returned per file, so that, e.g., if a source file has been appended to, we could truncate the destination and append the new data, incrementally updating it. But until HDFS supports truncation this is moot.

          Show
          Doug Cutting added a comment - > We use these 6.25MB second level CRCs as the checksum of the entire file. Why not just use the MD5 or SHA1 of the CRCs? When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to compute them from the CRC's. Even if they are to be pre-computed, then we might still use the CRCs, to reduce FileSystem upgrade time. If checksums were pre-computed, where would they be stored? We could store them in the NameNode, with file metadata, or we could store per-block checksums on datanodes. My hunch is that we should compute them on demand from CRC data. We extend ClientDatanodeProtocol to add a getChecksum() operation that returns the checksum for a block without transmitting the CRCs to the client, and the client combines block checksums to get a whole-file checksum. This is rather expensive, but still a lot faster than checksumming the entire file on demand. DistCp would be substantially faster if it only used checksums when file lengths match, so we should probably make that optimization. Longer-term we could think about a checksum API that permits a sequence of checksums to be returned per file, so that, e.g., if a source file has been appended to, we could truncate the destination and append the new data, incrementally updating it. But until HDFS supports truncation this is moot.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Currently, Datanode stores a CRC-32 for every 512-byte chunk. Let's call these CRCs the first level CRC. So the total size for the first level CRC is about 1/128 of the data size.

          How about we compute a second level of checksum over the first level CRCs? So, for every 512-byte first level CRC, we compute a CRC-32. Then, the second level CRC is about 1/16384 of the data size. We could use these second level CRCs as the checksum of the file.

          For example, if a file has 100GB, the size of first level CRCs is 800MB and the size of the second level CRCs is only 6.25MB. We use these 6.25MB second level CRCs as the checksum of the entire file.

          Show
          Tsz Wo Nicholas Sze added a comment - Currently, Datanode stores a CRC-32 for every 512-byte chunk. Let's call these CRCs the first level CRC. So the total size for the first level CRC is about 1/128 of the data size. How about we compute a second level of checksum over the first level CRCs? So, for every 512-byte first level CRC, we compute a CRC-32. Then, the second level CRC is about 1/16384 of the data size. We could use these second level CRCs as the checksum of the file. For example, if a file has 100GB, the size of first level CRCs is 800MB and the size of the second level CRCs is only 6.25MB. We use these 6.25MB second level CRCs as the checksum of the entire file.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > One simple approach could be checksum of checksums/metadata.

          +1 we probably should compute a hash tree or md5 over the existing checksums.

          Show
          Tsz Wo Nicholas Sze added a comment - > One simple approach could be checksum of checksums/metadata. +1 we probably should compute a hash tree or md5 over the existing checksums.
          Hide
          Lohit Vijayarenu added a comment -

          One simple approach could be checksum of checksums/metadata.

          Show
          Lohit Vijayarenu added a comment - One simple approach could be checksum of checksums/metadata.
          Tsz Wo Nicholas Sze made changes -
          Field Original Value New Value
          Link This issue is blocked by HADOOP-3941 [ HADOOP-3941 ]
          Tsz Wo Nicholas Sze created issue -

            People

            • Assignee:
              Tsz Wo Nicholas Sze
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development