Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 3.1.0
-
None
-
ghx-label-9
Description
Dataload has several locations where it does a long string of HDFS commands similar to this:
hdfs dfs -mkdir bad_table1 hdfs dfs -put bad_file_1 bad_table1 hdfs dfs -put bad_file_2 bad_table1 hdfs dfs -mkdir bad_table2 hdfs dfs -put bad_file_3 bad_table2 hdfs dfs -put bad_file_4 bad_table2
Most hdfs shell commands can take multiple arguments. In particular, "mkdir" can make multiple directories in one command. "put" can copy multiple files into a single destination. This can save on hdfs commandline invocations, which are often expensive due to JVM startup and other costs. For example, the above is equivalent to:
hdfs dfs -mkdir bad_table1 bad_table2 hdfs dfs -put bad_file_1 bad_file_2 bad_table1 hdfs dfs -put bad_file_3 bad_file_4 bad_table2
Dataload should make these types of optimizations wherever possible.