Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14269

Performance optimizations for data on S3

Log workAgile BoardRank to TopRank to BottomAdd voteVotersWatch issueWatchersCreate sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Working with tables that resides on Amazon S3 (or any other object store) have several performance impact when reading or writing data, and also consistency issues.

      This JIRA is an umbrella task to monitor all the performance improvements that can be done in Hive to work better with S3 data.

        Attachments

        Issue Links

        1.
        Write temporary data to HDFS when doing inserts on tables located on S3 Sub-task Resolved Sergio Peña   Actions
        2.
        Skip 'distcp' call when copying data from HDSF to S3 Sub-task Patch Available Sergio Peña   Actions
        3.
        FileSinkOperator should not rename files to final paths when S3 is the default destination Sub-task Reopened Sergio Peña   Actions
        4.
        ConditionalResolverMergeFiles should keep staging data on HDFS, then copy (no rename) to S3 Sub-task Resolved Sergio Peña   Actions
        5.
        Remove Hive file listing during split computation Sub-task Resolved Peter Varga

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 40m
        Actions
        6.
        Change .hive-staging directory created on S3 to HDFS when using INSERT OVERWRITE Sub-task Resolved Unassigned   Actions
        7.
        Add integration tests for hive on S3 Sub-task Resolved Thomas Poepping   Actions
        8.
        S3-to-S3 Renames: Files should be moved individually rather than at a directory level Sub-task Resolved Sahil Takiar   Actions
        9.
        Remove extra MoveTask operators from the ConditionalTask Sub-task Resolved Sergio Peña   Actions
        10.
        Last MR job in Hive should be able to write to a different scratch directory Sub-task Reopened Sahil Takiar   Actions
        11.
        Investigate if staging data on S3 can always go under the scratch dir Sub-task Open Unassigned   Actions
        12.
        Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel Sub-task Resolved Sahil Takiar   Actions
        13.
        Blobstores should use fs.listFiles(path, recursive=true) rather than FileUtils.listStatusRecursively Sub-task Resolved Unassigned   Actions
        14.
        Fix NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V Sub-task Closed Jesus Camacho Rodriguez   Actions
        15.
        Add support for using Hadoop's S3A OutputCommitter Sub-task Patch Available Unassigned   Actions
        16.
        inheritPerms should be conditional based on the target filesystem Sub-task Resolved Sahil Takiar   Actions
        17.
        druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x Sub-task Closed Steve Loughran   Actions
        18.
        Dynamic Partitioning Integration with Hadoop's S3A OutputCommitter Sub-task Open Unassigned   Actions
        19.
        Ability to selectively run tests in TestBlobstoreCliDriver Sub-task Open Sahil Takiar   Actions
        20.
        PerfLogger integration for critical Hive-on-S3 paths Sub-task Resolved Sahil Takiar   Actions
        21.
        Add --SORT_QUERY_RESULTS to hive-blobstore/map_join.q.out Sub-task Resolved Sahil Takiar   Actions
        22.
        Add Parquet specific tests to BlobstoreCliDriver Sub-task Resolved Sahil Takiar   Actions

          Activity

          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users
          Cancel

            People

            • Assignee:
              spena Sergio Peña Assign to me
              Reporter:
              spena Sergio Peña

              Dates

              • Created:
                Updated:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 40m
              40m

                Issue deployment