Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6899

Dataload uses excessive HDFS commands

    XMLWordPrintableJSON

Details

    Description

      Dataload has several locations where it does a long string of HDFS commands similar to this:

       

      hdfs dfs -mkdir bad_table1
      hdfs dfs -put bad_file_1 bad_table1
      hdfs dfs -put bad_file_2 bad_table1
      hdfs dfs -mkdir bad_table2
      hdfs dfs -put bad_file_3 bad_table2
      hdfs dfs -put bad_file_4 bad_table2

      Most hdfs shell commands can take multiple arguments. In particular, "mkdir" can make multiple directories in one command. "put" can copy multiple files into a single destination. This can save on hdfs commandline invocations, which are often expensive due to JVM startup and other costs. For example, the above is equivalent to:

       

      hdfs dfs -mkdir bad_table1 bad_table2
      hdfs dfs -put bad_file_1 bad_file_2 bad_table1
      hdfs dfs -put bad_file_3 bad_file_4 bad_table2

      Dataload should make these types of optimizations wherever possible.

       

       

      Attachments

        Activity

          People

            joemcdonnell Joe McDonnell
            joemcdonnell Joe McDonnell
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: