Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.2.0
-
None
-
None
Description
For bucketed tables, FileSinkOperator is expected (in some cases) to produce a specific number of files even if they are empty.
FileSinkOperator.closeOp(boolean abort) has logic to create files even if empty.
This doesn't property work for Acid path. For Insert, the OrcRecordUpdater(s) is set up in createBucketForFileIdx() which creates the actual bucketN file (as of HIVE-14007, it does it regardless of whether RecordUpdater sees any rows). This causes empty (i.e.ORC metadata only) bucket files to be created for multiFileSpray=true if a particular FileSinkOperator.process() sees at least 1 row. For example,
create table fourbuckets (a int, b int) clustered by (a) into 4 buckets stored as orc TBLPROPERTIES ('transactional'='true'); insert into fourbuckets values(0,1),(1,1); with mapreduce.job.reduces = 1 or 2
For Update/Delete path, OrcRecordWriter is created lazily when the 1st row that needs to land there is seen. Thus it never creates empty buckets no mater what the value of skipFiles in closeOp(boolean).
Once Split Update does the split early (in operator pipeline) only the Insert path will matter since base and delta are the only files split computation, etc looks at. delete_delta is only for Acid internals so there is never any reason for create empty files there.
Also make sure to close RecordUpdaters in FileSinkOperator.abortWriters()
Attachments
Issue Links
- is blocked by
-
HIVE-16077 UPDATE/DELETE fails with numBuckets > numReducers
- Closed
- relates to
-
HIVE-13403 Make Streaming API not create empty buckets
- Resolved