The post-copy check is done by comparing the results from getFileChecksum(). The getFileChecksum() method is also used by some tools to check whether the destination copy needs to be updated. If a copy of the same content can have a different version of checksum than the source, these checks will no longer be used. Staying with CRC32 is a workaround, but this preclude the possibility of moving to the better performing CRC32C checksum.
One of the least invasive approaches is to follow one principle: allow the source checksum method to be used for the destination in mixed checksum environment. If the default is CRC32C, all newly created content will use CRC32C, but the existing data with CRC32 will stay with CRC32 even after DistCp. This allows gradual migration to CRC32C.
This approach requires the following capabilities:
- Clients should be able to find out the checksum type of existing data.
- Clients should be able to tell data nodes which checksum type to use for write.
Without append, these operations can be at file level. But if append is used, a file can contain more than one checksum types (See
HDFS-2130 for details), which forces the above operations to be performed for every block. However, exposing block-level detail is not desirable for the FileSystem abstraction.
I propose we add a configurable feature to make append() to follow the existing checksum method. For zero-byte files, default is used. For non-zero byte files, checking first block is sufficient. Expose this information to clients so that they can use it to specify the write checksum type. There will be additional setup time in the beginning of append(). For this reason, we want to keep the existing append behavior as default and add this new behavior as an option. Or may be the other way around.
As for exposing the checksum type information, we may add getFileChecksum method to return the checksum and type for the first n bytes of a file. For small n's, it requires contacting only one data node. This method can have other uses such as quick content version check when the header of the file is guaranteed to be different for different versions.
For create/writes, setting dfs.checksum.type works, but with the FileSystem cache on, the checksum type used for creating FSDataOutputStreamoutput won't change. For the data copy apps that need to switch the checksum type, fs.<fs name>.impl.disable.cache may be set to get a unique instance every time. When dealing with a long list of files, call close() for each instance to avoid bloat and oom.
This is my rough idea, which I have implemented partially so far. An HDFS subtask may be created, if the changes in common and hdfs are not interdependent. Any feedback is appreciated.