Turns out this is actually fairly difficult. The reason is that the checksumming is done at the DFSOutputStream layer, rather than the DataStreamer layer. So, the checksum algorithm and chunk size needs to be known before the outputstream connects to the datanode.
Here are a few possible solutions:
1) When append() is called, make an RPC to the datanode hosting the last block of the file. This RPC will read the meta header and return the correct checksum. The DFSOutputStream then adopts that checksum.
- fairly simple to implement.
- Allows switching checksum type and chunk size.
- extra round-trip to set up the pipeline for append.
2) In the case of append, the DN can allow a writer to use a different checksum algorithm so long as the chunk size and checksum size are the same. In this case, it will verify the incoming packets using the writer's algorithm, then re-checksum them using the disk algorithm before writing to the meta file.
- no extra round-trip on pipeline creation.
- no need to change client code.
- when the client transitions to the next block in a file being appended, the new (preferred) checksum is used.
- There's a slight performance hit when filling up the last block of a file being appended.
- Not a general solution (only supports changing polynomial, not chunk size)
Any other ideas?