-
Type:
New Feature
-
Status: Patch Available
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 2.6.5
-
Fix Version/s: 2.6.6
-
Component/s: tools/distcp
-
Labels:None
DistCp is very slow to start when the src directory has a huge number of subdirectories. In our case, we already have the directory listing (via "hdfs oiv -i fsimage" or via nightly "hdfs dfs -lr -r /" dumps), and we would like to use that instead of doing realtime listing on the NameNode.
The "-f" option doesn't help in this case because it would try to put everything into a single flat target directory.
We'd like to introduce a new option "-listing <file>" for distcp. The <file> contains the result of listing the src directory.
In order to achieve this, we plan to:
1. Add a new CopyListing class PregeneratedCopyListing similar to SimpleCopyListing which doesn't "-ls -r" into the directory, but takes the listing via "-list"
2. Add an option "-list <file>" which will automatically make distcp use the new PregeneratedCopyListing class.