[HIVE-23354] Remove file size sanity checking from compareTempOrDuplicateFiles - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0.0-alpha-1
Component/s: HiveServer2
Labels:
- pull-request-available

Target Version/s:

4.0.0

Description

https://github.com/apache/hive/blob/cdd55aa319a3440963a886ebfff11cd2a240781d/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1952-L2010
compareTempOrDuplicateFiles uses a combination of attemptId and fileSize to determine which file(s) to keep.
I've seen instances where this function throws an exception due to the fact that the newer attemptId file size is less than the older attemptId (thus failing the query).
I think this assumption is faulty, due to various factors such as file compression and the order in which values are written. It may be prudent to trust that the newest attemptId is in fact the best choice.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-23354.1.patch
02/May/20 19:38
5 kB
John Sherman
HIVE-23354.2.patch
13/May/20 00:23
11 kB
John Sherman
HIVE-23354.3.patch
14/May/20 00:52
13 kB
John Sherman
HIVE-23354.4.patch
14/May/20 02:11
16 kB
John Sherman
HIVE-23354.5.patch
14/May/20 19:09
17 kB
John Sherman
HIVE-23354.6.patch
15/May/20 13:35
17 kB
John Sherman
HIVE-23354.7.patch
17/May/20 15:28
17 kB
John Sherman

Issue Links

causes

HIVE-23614 Always pass HiveConfig to removeTempOrDuplicateFiles

Closed

links to

GitHub Pull Request #1022

Activity

People

Assignee:: John Sherman

Reporter:: John Sherman

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/May/20 02:22

Updated:: 27/Feb/24 22:23

Resolved:: 18/May/20 20:20

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m