Details
-
Sub-task
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Hive doesn't have integration with Hadoop's OutputCommitter, it uses a NullOutputCommitter and uses its own commit logic spread across FileSinkOperator, MoveTask, and Hive.
The Hadoop community is building an OutputCommitter that integrates with S3Guard and does a safe, coordinate commit of data on S3 inside individual tasks (HADOOP-13786). If Hive can integrate with this new OutputCommitter there would be a lot of benefits to Hive-on-S3:
- Data is only written once; directly committing data at a task level means no renames are necessary
- The commit is done safely, in a coordinated manner; duplicate tasks (from task retries or speculative execution) should not step on each other
Attachments
Attachments
Issue Links
- is blocked by
-
HIVE-19217 Upgrade to Hadoop 3.1.0
- Closed
-
HIVE-18319 Upgrade to Hadoop 3.0.0
- Closed
- is related to
-
HADOOP-15421 Stabilise/formalise the JSON _SUCCESS format used in the S3A committers
- Resolved
-
HADOOP-13786 Add S3A committers for zero-rename commits to S3 endpoints
- Resolved
- relates to
-
HIVE-19321 Dynamic Partitioning Integration with Hadoop's S3A OutputCommitter
- Open
- links to