Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14269 Performance optimizations for data on S3
  3. HIVE-15546

Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel


    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.3.0
    • Component/s: Hive
    • Labels:


      When running on blobstores (like S3) where metadata operations (like listStatus) are costly, Utilities.getInputPaths() can add significant overhead when setting up the input paths for an MR / Spark / Tez job.

      The method performs a listStatus on all input paths in order to check if the path is empty. If the path is empty, a dummy file is created for the given partition. This is all done sequentially. This can be really slow when there are a lot of empty partitions. Even when all partitions have input data, this can take a long time.

      We should either:

      (1) Just remove the logic to check if each input path is empty, and handle any edge cases accordingly.

      (2) Multi-thread the listStatus calls


        1. HIVE-15546.6.patch
          9 kB
          Sahil Takiar
        2. HIVE-15546.5.patch
          9 kB
          Sahil Takiar
        3. HIVE-15546.4.patch
          9 kB
          Sahil Takiar
        4. HIVE-15546.3.patch
          6 kB
          Sahil Takiar
        5. HIVE-15546.2.patch
          6 kB
          Sahil Takiar
        6. HIVE-15546.1.patch
          0.7 kB
          Sahil Takiar

          Issue Links



              • Assignee:
                stakiar Sahil Takiar
                stakiar Sahil Takiar
              • Votes:
                0 Vote for this issue
                9 Start watching this issue


                • Created: