The hdfs dfs -ls -R command is sequential in nature and is very slow for a HCFS system. We have seen around 6 mins for 40K directory/files structure.
The proposal is to use multithreading approach to speed up recursive list, du and count operations.
We have tried a ForkJoinPool implementation to improve performance for recursive listing operation.
commit id :
Another implementation is to use Java Executor Service to improve performance to run listing operation in multiple threads in parallel. This has significantly reduced the time to 40 secs from 6 mins.