As a follow-up to
CRUNCH-660 and CRUNCH-675, a handful of corrections and improvements have been identified during testing.
- We need to preserve preferred part names, e.g. part-m-00000. Currently the DistCp support in Crunch does not make use of the FileTargetImpl#getDestFile method, and would therefore create destination file names like out0-m-00000, which are problematic when there are multiple map-only jobs writing to the same target path. This can be achieved by providing a custom CopyListing implementation that is capable of dynamically renaming target paths based on a given mapping. Unfortunately a substantial amount of code duplication from the original SimpleCopyListing class is currently required in order to inject the necessary logic for modifying the sequence file entry keys.
HADOOP-16147 has been opened to allow it to be simplified in the future.
- The handleOutputs implementation in HFileTarget is essentially identical to the one in FileTargetImpl that it overrides. We can remove it and just share the same code.
- It could be useful to add a property for configuring the max DistCp task bandwidth, as the default (100 MB/s per task) may be too high for certain environments.
- The default of 1000 for max DistCp map tasks may be too high in some situations resulting in 503 Slow Down errors from S3 especially if there are multiple jobs writing into the same bucket. Reducing to 100 should help prevent issues along those lines while still providing adequate parallel throughput.