> We use these 6.25MB second level CRCs as the checksum of the entire file.
Why not just use the MD5 or SHA1 of the CRCs?
When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to compute them from the CRC's. Even if they are to be pre-computed, then we might still use the CRCs, to reduce FileSystem upgrade time.
If checksums were pre-computed, where would they be stored? We could store them in the NameNode, with file metadata, or we could store per-block checksums on datanodes.
My hunch is that we should compute them on demand from CRC data. We extend ClientDatanodeProtocol to add a getChecksum() operation that returns the checksum for a block without transmitting the CRCs to the client, and the client combines block checksums to get a whole-file checksum. This is rather expensive, but still a lot faster than checksumming the entire file on demand. DistCp would be substantially faster if it only used checksums when file lengths match, so we should probably make that optimization.
Longer-term we could think about a checksum API that permits a sequence of checksums to be returned per file, so that, e.g., if a source file has been appended to, we could truncate the destination and append the new data, incrementally updating it. But until HDFS supports truncation this is moot.