Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6899

Dataload uses excessive HDFS commands

    Details

      Description

      Dataload has several locations where it does a long string of HDFS commands similar to this:

       

      hdfs dfs -mkdir bad_table1
      hdfs dfs -put bad_file_1 bad_table1
      hdfs dfs -put bad_file_2 bad_table1
      hdfs dfs -mkdir bad_table2
      hdfs dfs -put bad_file_3 bad_table2
      hdfs dfs -put bad_file_4 bad_table2

      Most hdfs shell commands can take multiple arguments. In particular, "mkdir" can make multiple directories in one command. "put" can copy multiple files into a single destination. This can save on hdfs commandline invocations, which are often expensive due to JVM startup and other costs. For example, the above is equivalent to:

       

      hdfs dfs -mkdir bad_table1 bad_table2
      hdfs dfs -put bad_file_1 bad_file_2 bad_table1
      hdfs dfs -put bad_file_3 bad_file_4 bad_table2

      Dataload should make these types of optimizations wherever possible.

       

       

        Attachments

          Activity

            People

            • Assignee:
              joemcdonnell Joe McDonnell
              Reporter:
              joemcdonnell Joe McDonnell
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: