Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14269

Performance optimizations for data on S3

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.1.0
    • None
    • None
    • None

    Description

      Working with tables that resides on Amazon S3 (or any other object store) have several performance impact when reading or writing data, and also consistency issues.

      This JIRA is an umbrella task to monitor all the performance improvements that can be done in Hive to work better with S3 data.

      Attachments

        Issue Links

          1.
          Write temporary data to HDFS when doing inserts on tables located on S3 Sub-task Resolved Sergio Peña  
          2.
          Skip 'distcp' call when copying data from HDSF to S3 Sub-task Patch Available Sergio Peña  
          3.
          FileSinkOperator should not rename files to final paths when S3 is the default destination Sub-task Reopened Sergio Peña  
          4.
          ConditionalResolverMergeFiles should keep staging data on HDFS, then copy (no rename) to S3 Sub-task Resolved Sergio Peña  
          5.
          Remove Hive file listing during split computation Sub-task Closed Peter Varga

          100%

          Original Estimate - Not Specified Original Estimate - Not Specified
          Time Spent - 40m
          6.
          Change .hive-staging directory created on S3 to HDFS when using INSERT OVERWRITE Sub-task Resolved Unassigned  
          7.
          Add integration tests for hive on S3 Sub-task Resolved Thomas Poepping  
          8.
          S3-to-S3 Renames: Files should be moved individually rather than at a directory level Sub-task Resolved Sahil Takiar  
          9.
          Remove extra MoveTask operators from the ConditionalTask Sub-task Resolved Sergio Peña  
          10.
          Last MR job in Hive should be able to write to a different scratch directory Sub-task Reopened Sahil Takiar  
          11.
          Investigate if staging data on S3 can always go under the scratch dir Sub-task Open Unassigned  
          12.
          Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel Sub-task Resolved Sahil Takiar  
          13.
          Blobstores should use fs.listFiles(path, recursive=true) rather than FileUtils.listStatusRecursively Sub-task Resolved Unassigned  
          14.
          Fix NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V Sub-task Closed Jesús Camacho Rodríguez  
          15.
          Add support for using Hadoop's S3A OutputCommitter Sub-task Patch Available Unassigned  
          16.
          inheritPerms should be conditional based on the target filesystem Sub-task Resolved Sahil Takiar  
          17.
          druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x Sub-task Closed Steve Loughran  
          18.
          Dynamic Partitioning Integration with Hadoop's S3A OutputCommitter Sub-task Open Unassigned  
          19.
          Ability to selectively run tests in TestBlobstoreCliDriver Sub-task Open Sahil Takiar  
          20.
          PerfLogger integration for critical Hive-on-S3 paths Sub-task Closed Sahil Takiar  
          21.
          Add --SORT_QUERY_RESULTS to hive-blobstore/map_join.q.out Sub-task Closed Sahil Takiar  
          22.
          Add Parquet specific tests to BlobstoreCliDriver Sub-task Closed Sahil Takiar  

          Activity

            People

              spena Sergio Peña
              spena Sergio Peña
              Votes:
              0 Vote for this issue
              Watchers:
              35 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m