1. distcp -update launches job when there is at least one dir in source paths to be copied, even though there is nothing to copy.
HADOOP-5675 added fileCount > 0 to be checked to decide whether to launch job. And HADOOP-5762 changed this to fileCount + dirCount > 0 to solve the issue of empty directories not getting copied to destination. With -update, dirCount is incremented without checking if that dir already exists at the destination. So distcp job is launched because of dirCount > 0 even though there is nothing to copy. Incrementing dirCount can be skipped if that dir already exists at the destination in case of -update.
2. distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
When we do
hadoop distcp -update srcfilename destfilename
it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
MAPREDUCE-644 distcp does not skip copying file if we are updating single file
- relates to
HADOOP-5762 distcp does not copy empty directories