[HIVE-14269] Performance optimizations for data on S3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

Working with tables that resides on Amazon S3 (or any other object store) have several performance impact when reading or writing data, and also consistency issues.

This JIRA is an umbrella task to monitor all the performance improvements that can be done in Hive to work better with S3 data.

Attachments

Issue Links

depends upon

HADOOP-11694 Über-jira: S3a phase II: robustness, scale and performance

Resolved

HIVE-14323 Reduce number of FS permissions and redundant FS operations

Closed

is depended upon by

HADOOP-13525 Optimize uses of FS operations in the ASF analysis frameworks and libraries

Resolved

is related to

HIVE-16277 Exchange Partition between filesystems throws "IllegalArgumentException Wrong FS"

Open

HIVE-1620 Patch to write directly to S3 from Hive

Open

relates to

HADOOP-13204 Über-jira: S3a phase III: scale and tuning

Resolved

HIVE-14920 S3: Optimize SimpleFetchOptimizer::checkThreshold()

Closed

(2 relates to)

Sub-Tasks

Write temporary data to HDFS when doing inserts on tables located on S3

Resolved

Sergio Peña

Skip 'distcp' call when copying data from HDSF to S3

Patch Available

Sergio Peña

FileSinkOperator should not rename files to final paths when S3 is the default destination

Reopened

Sergio Peña

ConditionalResolverMergeFiles should keep staging data on HDFS, then copy (no rename) to S3

Resolved

Sergio Peña

Remove Hive file listing during split computation

Closed

Peter Varga

100%

Change .hive-staging directory created on S3 to HDFS when using INSERT OVERWRITE

Resolved

Unassigned

Add integration tests for hive on S3

Resolved

Thomas Poepping

S3-to-S3 Renames: Files should be moved individually rather than at a directory level

Resolved

Sahil Takiar

Remove extra MoveTask operators from the ConditionalTask

Resolved

Sergio Peña

10.

Last MR job in Hive should be able to write to a different scratch directory

Reopened

Sahil Takiar

11.

Investigate if staging data on S3 can always go under the scratch dir

Open

Unassigned

12.

Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Resolved

Sahil Takiar

13.

Blobstores should use fs.listFiles(path, recursive=true) rather than FileUtils.listStatusRecursively

Resolved

Unassigned

14.

Fix NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V

Closed

Jesús Camacho Rodríguez

15.

Add support for using Hadoop's S3A OutputCommitter

Patch Available

Unassigned

16.

inheritPerms should be conditional based on the target filesystem

Resolved

Sahil Takiar

17.

druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

Closed

Steve Loughran

18.

Dynamic Partitioning Integration with Hadoop's S3A OutputCommitter

Open

Unassigned

19.

Ability to selectively run tests in TestBlobstoreCliDriver

Open

Sahil Takiar

20.

PerfLogger integration for critical Hive-on-S3 paths

Closed

Sahil Takiar

21.

Add --SORT_QUERY_RESULTS to hive-blobstore/map_join.q.out

Closed

Sahil Takiar

22.

Add Parquet specific tests to BlobstoreCliDriver

Closed

Sahil Takiar

Activity

People

Assignee:: Sergio Peña

Reporter:: Sergio Peña

Votes:: 0 Vote for this issue

Watchers:: 35 Start watching this issue

Dates

Created:: 18/Jul/16 19:23

Updated:: 15/Feb/19 21:36

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m

Include sub-tasks