Outline of requirements and decisions made:
I will include more details on some of these points as the implementation or discussion continues. I think I can start implementation of some of the parts.
Here DFS refers to Namenode(NN) and Datanodes(DNs), and client refers to
1) Checksums are maintained end to end and maintained in datanodes as metadata for each block file. 4 byte CRC32 is calculated for for configure size of of sub-block data. Default is 4 byte CRC for each 64KB of block data.
2) DN keeps checksums for each blocks as separate files (e.g.: blk_<id>.crc) and includes a header at the front of the file.
3) Checksums are calculated by client while writing and passed on to DN. DN verifies before writing to disk. DN verifies the checksum each time it reads the block data to serve to a client and client verifies it as well.
4) Data transfer protocol between client and datanodes includes inline checksums transmitted along with the data on same TCP connection. Client reads from a different replica when checksum fails from a DN.
5) When DN notices a checksum failure, it informs namenode. Namenode will treat this as a deleted block in initial implementation. Later improvement will delay delation of the block until a new valid replica is created.
6) DistributedFileSystem class will not extend ChecksumFileSystem since checksums will be integral to DFS. We could have a ChecksumDistributedFileSystem if weneed user visible checksums.
7) Upgrade : When DFS is upgraded to this new version, DFS cluster will in safemode until all (or most of) the datanodes upgrade thier local files with checksums. This process is expected to last for couple of hours.
8) Currently each DFS file has associated checksum stored in ".crc" file. During upgrade, datanodes fetch relevant parts of .crc files to verify checksums for each block. Since this involves interaction with namenode, namenode could be busy or even bottleneck for upgrade.
9) We haven't decided how and when to delete .crc files. They could be deleted by a shell script as well.
Future enhancements :
1) Bechmark CPU and I/O overhead of checksums on Datanodes. Most of CPU overhead could be hidden with overlapping with network and disk I/O. Tests showed that java CRC32 takes 5-6 micro seconds for each 1MB of data. Because of the overlap I don't expect any noticeable increase in latency because of CPU. Disk I/O overhead might contribute more for latency.
2) Based on benchmark tests, investigate if in-memory cache can be used for CRC.
3) Datanodes should periodically scan and verify checksums for its blocks.
4) Option to change CRC-block size. E.g. during the upgrade, datanodes maintain 4 byte CRC for every 512 bytes since ".crc" files used 512 byte chunks. We might want to convert them to 4 bytes for every 64K bytes of data.
5) Namenode should delay deletion of a corrupted block until a new replica is created.