Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Hello,
the specific scenario when this can happen:
- the execution engine is Tez;
- speculative execution is on;
- the query inserts into a table and the last step is a UNION sql clause;
The problem is that Tez creates an extra layer of subdirectories when there is a UNION. Later, when deduplicating, Hive doesn't take that into account and only deduplicates folders but not the files inside.
So for a query like this:
insert overwrite table union_all select * from union_first_part union all select * from union_second_part;
The folder structure afterwards will be like this (a possible example):
.../union_all/HIVE_UNION_SUBDIR_1/000000_0 .../union_all/HIVE_UNION_SUBDIR_1/000000_1 .../union_all/HIVE_UNION_SUBDIR_2/000000_1
The attached patch increases the number of folder levels that Hive will check recursively for duplicates when we have a UNION in Tez.
Feel free to reach out if you have any questions .
Attachments
Attachments
Issue Links
- is related to
-
HIVE-27494 Deduplicate the task result that generated by more branches in union all
- Closed
- links to