Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14269 Performance optimizations for data on S3
  3. HIVE-15546

Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • Hive
    • None

    Description

      When running on blobstores (like S3) where metadata operations (like listStatus) are costly, Utilities.getInputPaths() can add significant overhead when setting up the input paths for an MR / Spark / Tez job.

      The method performs a listStatus on all input paths in order to check if the path is empty. If the path is empty, a dummy file is created for the given partition. This is all done sequentially. This can be really slow when there are a lot of empty partitions. Even when all partitions have input data, this can take a long time.

      We should either:

      (1) Just remove the logic to check if each input path is empty, and handle any edge cases accordingly.

      (2) Multi-thread the listStatus calls

      Attachments

        1. HIVE-15546.1.patch
          0.7 kB
          Sahil Takiar
        2. HIVE-15546.2.patch
          6 kB
          Sahil Takiar
        3. HIVE-15546.3.patch
          6 kB
          Sahil Takiar
        4. HIVE-15546.4.patch
          9 kB
          Sahil Takiar
        5. HIVE-15546.5.patch
          9 kB
          Sahil Takiar
        6. HIVE-15546.6.patch
          9 kB
          Sahil Takiar

        Issue Links

          Activity

            People

              stakiar Sahil Takiar
              stakiar Sahil Takiar
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: