Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-14269

Performance optimizations for data on S3

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Working with tables that resides on Amazon S3 (or any other object store) have several performance impact when reading or writing data, and also consistency issues.

      This JIRA is an umbrella task to monitor all the performance improvements that can be done in Hive to work better with S3 data.

        Attachments

          Issue Links

          1.
          Write temporary data to HDFS when doing inserts on tables located on S3 Sub-task Resolved Sergio Peña
          2.
          Skip 'distcp' call when copying data from HDSF to S3 Sub-task Patch Available Sergio Peña
          3.
          FileSinkOperator should not rename files to final paths when S3 is the default destination Sub-task Reopened Sergio Peña
          4.
          ConditionalResolverMergeFiles should keep staging data on HDFS, then copy (no rename) to S3 Sub-task Resolved Sergio Peña
          5.
          Remove Hive file listing during split computation Sub-task Open Sahil Takiar
          6.
          Change .hive-staging directory created on S3 to HDFS when using INSERT OVERWRITE Sub-task Resolved Unassigned
          7.
          Add integration tests for hive on S3 Sub-task Resolved Thomas Poepping
          8.
          S3-to-S3 Renames: Files should be moved individually rather than at a directory level Sub-task Resolved Sahil Takiar
          9.
          Remove extra MoveTask operators from the ConditionalTask Sub-task Resolved Sergio Peña
          10.
          Last MR job in Hive should be able to write to a different scratch directory Sub-task Reopened Sahil Takiar
          11.
          Investigate if staging data on S3 can always go under the scratch dir Sub-task Open Unassigned
          12.
          Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel Sub-task Resolved Sahil Takiar
          13.
          Blobstores should use fs.listFiles(path, recursive=true) rather than FileUtils.listStatusRecursively Sub-task Open Unassigned
          14.
          Fix NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V Sub-task Closed Jesus Camacho Rodriguez
          15.
          Add support for using Hadoop's S3A OutputCommitter Sub-task Patch Available Sahil Takiar
          16.
          inheritPerms should be conditional based on the target filesystem Sub-task Resolved Sahil Takiar
          17.
          druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x Sub-task Closed Steve Loughran
          18.
          Dynamic Partitioning Integration with Hadoop's S3A OutputCommitter Sub-task Open Unassigned
          19.
          Ability to selectively run tests in TestBlobstoreCliDriver Sub-task Open Sahil Takiar
          20.
          PerfLogger integration for critical Hive-on-S3 paths Sub-task Resolved Sahil Takiar
          21.
          Add --SORT_QUERY_RESULTS to hive-blobstore/map_join.q.out Sub-task Resolved Sahil Takiar
          22.
          Add Parquet specific tests to BlobstoreCliDriver Sub-task Resolved Sahil Takiar

            Activity

              People

              • Assignee:
                spena Sergio Peña
                Reporter:
                spena Sergio Peña
              • Votes:
                0 Vote for this issue
                Watchers:
                35 Start watching this issue

                Dates

                • Created:
                  Updated: