Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14269 Performance optimizations for data on S3
  3. HIVE-15546

Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • Hive
    • None

    Description

      When running on blobstores (like S3) where metadata operations (like listStatus) are costly, Utilities.getInputPaths() can add significant overhead when setting up the input paths for an MR / Spark / Tez job.

      The method performs a listStatus on all input paths in order to check if the path is empty. If the path is empty, a dummy file is created for the given partition. This is all done sequentially. This can be really slow when there are a lot of empty partitions. Even when all partitions have input data, this can take a long time.

      We should either:

      (1) Just remove the logic to check if each input path is empty, and handle any edge cases accordingly.

      (2) Multi-thread the listStatus calls

      Attachments

        1. HIVE-15546.1.patch
          0.7 kB
          Sahil Takiar
        2. HIVE-15546.2.patch
          6 kB
          Sahil Takiar
        3. HIVE-15546.3.patch
          6 kB
          Sahil Takiar
        4. HIVE-15546.4.patch
          9 kB
          Sahil Takiar
        5. HIVE-15546.5.patch
          9 kB
          Sahil Takiar
        6. HIVE-15546.6.patch
          9 kB
          Sahil Takiar

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            stakiar Sahil Takiar Assign to me
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment