-
Type:
Sub-task
-
Status: Patch Available
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Target Version/s:
Hive doesn't have integration with Hadoop's OutputCommitter, it uses a NullOutputCommitter and uses its own commit logic spread across FileSinkOperator, MoveTask, and Hive.
The Hadoop community is building an OutputCommitter that integrates with S3Guard and does a safe, coordinate commit of data on S3 inside individual tasks (HADOOP-13786). If Hive can integrate with this new OutputCommitter there would be a lot of benefits to Hive-on-S3:
- Data is only written once; directly committing data at a task level means no renames are necessary
- The commit is done safely, in a coordinated manner; duplicate tasks (from task retries or speculative execution) should not step on each other
- is blocked by
-
HIVE-19217 Upgrade to Hadoop 3.1.0
-
- Resolved
-
-
HIVE-18319 Upgrade to Hadoop 3.0.0
-
- Resolved
-
- is related to
-
HADOOP-15421 Stabilise/formalise the JSON _SUCCESS format used in the S3A committers
-
- Resolved
-
-
HADOOP-13786 Add S3A committers for zero-rename commits to S3 endpoints
-
- Resolved
-
- relates to
-
HIVE-19321 Dynamic Partitioning Integration with Hadoop's S3A OutputCommitter
-
- Open
-
- links to