[HIVE-22098] Data loss occurs when multiple tables are join with different bucket_version - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Blocker
Resolution: Unresolved
Affects Version/s: 3.1.0, 3.1.2
Fix Version/s: None
Component/s: Operators
Labels:
- data-loss
- wrongresults

Target Version/s:

3.1.0

Description

When different bucketVersion of tables do join and no of reducers is greater than 2, the result is incorrect (data loss).
Scenario 1: Three tables join. The temporary result data of table_a in the first table and table_b in the second table joins result is recorded as tmp_a_b, When it joins with the third table, the bucket_version=2 of the table created by default after hive-3.0.0, temporary data tmp_a_b initialized the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In the init method, the hash algorithm of selecting join column is selected according to bucketVersion. If bucketVersion = 2 and is not an acid operation, it will acquired the new algorithm of hash. Otherwise, the old algorithm of hash is acquired. Because of the inconsistency of the algorithm of hash, the partition of data allocation caused are different. At stage of Reducer, Data with the same key can not be paired resulting in data loss.

Scenario 2: create two test tables, create table table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) TBLPROPERTIES ('bucketing_version'='2');
when use table_bucketversion_1 to join table_bucketversion_2, partial result data will be loss due to bucketVerison is different.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

table_c_data.orc
12/Aug/19 08:21
0.4 kB
GuangMing Lu
table_b_data.orc
12/Aug/19 08:21
0.4 kB
GuangMing Lu
table_a_data.orc
12/Aug/19 08:21
0.4 kB
GuangMing Lu
image-2019-08-12-18-45-15-771.png
12/Aug/19 10:45
8 kB
GuangMing Lu
HIVE-22098.1.patch
12/Aug/19 10:58
3 kB
GuangMing Lu
join_test.sql
22/Sep/20 01:29
0.6 kB
GuangMing Lu

Issue Links

is fixed by

HIVE-21304 Make bucketing version usage more robust

Closed

is related to

HIVE-18735 Create table like loses transactional attribute

Closed

HIVE-23809 Data loss occurs when using tez engine to join different bucketing_version tables

Patch Available

relates to

HIVE-18983 Add support for table properties inheritance in Create table like

Patch Available

Activity

People

Assignee:: Unassigned

Reporter:: GuangMing Lu

Votes:: 3 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 12/Aug/19 08:17

Updated:: 26/Sep/23 07:48