Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.3.0, 3.2.1
-
None
-
None
-
Patch
Description
We use distcp with -direct option to copy a file between two large directories. We found it costed a few minutes. If we launch too much distcp jobs at the same time, NameNode performance degradation is serious.
hadoop -direct -skipcrccheck -update -prbugaxt -i -numListstatusThreads 1 hdfs://cluster1:8020/source/100.log hdfs://cluster2:8020/target/100.jpg
Dir path | Count | |
---|---|---|
Source dir | hdfs://cluster1:8020/source/ | 100k+ files |
Target dir | hdfs://cluster2:8020/target/ | 100k+ files |
Check code in CopyCommitter.java, we find in function
deleteAttemptTempFiles() has a code targetFS.globStatus(new Path(targetWorkPath, ".distcp.tmp." + jobId.replaceAll("job","attempt") + "*"));
It will waste a lot of time when distcp between two large dirs. When we use distcp with -direct option, it will direct write to the target file without generate a '.distcp.tmp' temp file. So, i think this code need add a judgment before call function deleteAttemptTempFiles, if distcp with -direct option, do nothing , directly return .