[HIVE-14272] ConditionalResolverMergeFiles should keep staging data on HDFS, then copy (no rename) to S3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

If hive.merge.mapfiles is True, and the output table to write is on S3, then Hive will generate a conditional plan where smaller files will be merged into larger sizes.

If the output files written by the initial MR job are small, then a second MR job is run to merge the output into larger files (a copy from S3 to S3 in the current code).

If the original output files are large enough, then the conditional task is followed by a move/rename which is very expensive in S3.

We should keep staging data on HDFS previous to copying them to S3 as final files.

Attachments

Activity

People

Assignee:: Sergio Peña

Reporter:: Sergio Peña

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Jul/16 21:59

Updated:: 01/Nov/16 19:38

Resolved:: 01/Nov/16 19:38