|
[
Permlink
| « Hide
]
Lohit Vijayarenu added a comment - 20/Aug/08 06:10 PM
One simple approach could be checksum of checksums/metadata.
> One simple approach could be checksum of checksums/metadata.
+1 we probably should compute a hash tree or md5 over the existing checksums. Currently, Datanode stores a CRC-32 for every 512-byte chunk. Let's call these CRCs the first level CRC. So the total size for the first level CRC is about 1/128 of the data size.
How about we compute a second level of checksum over the first level CRCs? So, for every 512-byte first level CRC, we compute a CRC-32. Then, the second level CRC is about 1/16384 of the data size. We could use these second level CRCs as the checksum of the file. For example, if a file has 100GB, the size of first level CRCs is 800MB and the size of the second level CRCs is only 6.25MB. We use these 6.25MB second level CRCs as the checksum of the entire file. > We use these 6.25MB second level CRCs as the checksum of the entire file.
Why not just use the MD5 or SHA1 of the CRCs? When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to compute them from the CRC's. Even if they are to be pre-computed, then we might still use the CRCs, to reduce FileSystem upgrade time. If checksums were pre-computed, where would they be stored? We could store them in the NameNode, with file metadata, or we could store per-block checksums on datanodes. My hunch is that we should compute them on demand from CRC data. We extend ClientDatanodeProtocol to add a getChecksum() operation that returns the checksum for a block without transmitting the CRCs to the client, and the client combines block checksums to get a whole-file checksum. This is rather expensive, but still a lot faster than checksumming the entire file on demand. DistCp would be substantially faster if it only used checksums when file lengths match, so we should probably make that optimization. Longer-term we could think about a checksum API that permits a sequence of checksums to be returned per file, so that, e.g., if a source file has been appended to, we could truncate the destination and append the new data, incrementally updating it. But until HDFS supports truncation this is moot. > Why not just use the MD5 or SHA1 of the CRCs?
+1. It is more appealing to have a small, fixed size checksum. Issues regd differences in bytes.per.checksum, blockSize etc are essentially same either with fixed length checksum or with 2nd level checksum that Nicholas proposed.
MD5 requires sequential access of data. One easy implementation of MD5-over-CRCs is that client read all CRCs from Datanodes and then compute MD5 over them. However, it requires reading all the first level CRCs, which is 800MB for a 100GB file. Is it too much network traffic? Raghu has a very good idea for another implementation, which computes MD5 across datanodes as follow: Client initiates the Datanode 1 (which has the first block) to compute MD5. Datanode 1 returns the intermediate status of MD5 computation to the Client and the Client send the intermediate states to Datanode 2 (which has the second block). Then, the Datanode 2 continues the MD5 computation and return the MD5 computation intermediate status to the Client, and so on. Note that this is not a parallel algorithm although it is a distributed algorithm. Another problem for MD5 in this implemenation is that there is no easy way to get the MD5 computation intermediate status in Java 1.6.
This is probably good. I will think about this.
It is better to compute file checksum on-demand, so that the Datanode storage layout remains unchanged and we won't have to do distributed upgrade.
My idea is similar to this except that we should not compute block checksum. Otherwise the file checksum computed depends on the block size. That is the reason that I propose to compute the second level CRCs over the first level CRCs. This idea is borrowed from hash tree (aka Merkle Trees), which is used by ZFS. > Otherwise the file checksum computed depends on the block size.
It still depends on bytes.per.checksum, which can vary per file, just like block size. If two files have different bytes.per.checksum then we should not compare CRC-derived checksums. Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes could be an algorithm name. If we compute these per-block, then the algorithm name would be MD5-of-CRC32-every-512bytes-with-64Mblocks. If we compute checksums on demand from CRCs then it will be relatively slow. Distcp thus needs to be sure to only get checksums when lengths match and the alternative is copying the entire file. So long as distcp is the primary client of checksums this is probably sufficient and we should not bother storing checksums. Another API to consider might be:
This way an HDFS filesystem might return ["MD5-of-CRC32-every-512bytes-with-64Mblocks", "MD5-of-CRC32-every-512bytes", "MD5"] the possible algorithms for a file in preferred order. Then Distcp could call this for two files (whose lengths match) to see if they have any compatible algorithms. If possible, CRC's would be combined on datanodes, but, if block sizes differ, the CRCs could be summed in the client. If the CRCs are incompatible, then MD5s could be computed on datanodes. Is this overkill? Probably. > Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes
+1 We definitely need to encode these details in the algorithm name. > Another API to consider ... Which API are you talking about, FileSystem API or HDFS API? If you mean HDFS API, are you saying that we should handle HDFS specially in DistCp? Currently, DistCp only uses FileSystem API. > Which API are you talking about, FileSystem API or HDFS API?
FileSystem. And that second method above should have been:
But this is probably overkill for now. Let's just choose a single algorithm and put its parameters in the algorithm name for now. We have to change the ClientDatanodeProtocol in any case. We can either compute per-block checksums on the datanode, or send CRCs to the client and sum them there. Let's just pick one or the other though for the first version. My preference would be to compute per-block checksums on the datanode, but I don't feel that strongly and would not veto the other approach. How about we implement MD5-of-CRC32-every-512bytes-with-64Mblocks and use it as the default file checksum algorithm for all FileSystem? Then, we don't have to change FileSystem API at this moment.
A few issues:
> implement MD5-of-CRC32-every-512bytes-with-64Mblocks and use it as the default file checksum algorithm for all FileSystem?
We should just implement this for HDFS, where CRCs already exist. > Should we return a list of MD5 [ ... ] ? No, just a single checksum for the entire file. > If bytes.per.checksum is not 512 or block size is not 64MB in HDFS, how about getFileChecksum(Path f) returns null? No, it should return a different algorithm string, with the file's bytes.per.checksum and block size. I now think returning null by default is probably best, rather than having a default implementation that uses file length, since we should check file lengths explicitly and only compare checksums when lengths differ. > For other FS, how about we return null? Yes, I agree. 3981_20080909.patch: implement MD5-of-xxxMD5-of-yyyCRC, where xxx is crc per block and yyy is bytes per crc. Need to fix some bugs.
Some comments:
> Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol?
Nicholas had briefly talk to me regd this. I was ok with either way. If RPCs are used, then other RPCs on the port should be prepared to handle delays on the order of minutes, since these checksum RPCs compete with the rest of the disk accesses. And there could be quite a few these requests. Datanode has just 3 RPC handlers.. we probably should not increase the handlers for this reason since checksum load would be very rare and DataNode is thread starved already. 3981_20080910.patch:
> It looks like you forgot to include the class MD5MD5CRC32FileChecksum in the patch. > Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol? > WritableUtils#toByteArray can use io.DataOutputBuffer, no? > DistCp#sameFile() should be changed to only get checksums when the lengths differ. Mostly minor :
3981_20080910b.patch:
Regarding to Raghu's comments Also fixed some bugs. 3981_20080912.patch: fixed a bug.
Passed all tests locally, submitting...
In case that Hudson is not working, here is the ant test-patch results:
[exec] +1 overall.
[exec] +1 @author. The patch does not contain any @author tags.
[exec] +1 tests included. The patch appears to include 6 new or modified tests.
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12390040/3981_20080912.patch against trunk revision 695690. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/testReport/ This message is automatically generated. I just committed this.
Integrated in Hadoop-trunk #605 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/605/
This is an incompatible change since the DATA_TRANSFER_VERSION should be changed. See
Integrated in Hadoop-trunk #611 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/
Integrated in Hadoop-Common-trunk #55 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/55/
Delete the empty file src/core/org/apache/hadoop/fs/LengthFileChecksum.java which was removed by . |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||