
Type: Bug

Status: Resolved

Priority: Major

Resolution: Fixed

Affects Version/s: None

Fix Version/s: 4.0.0

Component/s: Transactions

Labels:
The query based MINOR compaction uses the following sorting order in its inner query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in the code.
But actually the rows should be ordered by originalTransactionId, bucketProperty and rowId, otherwise the delete deltas cannot be applied correctly. And this is the order what the MR MAJOR and MR MINOR compactions write.
The sorting order used by the querybased MINOR compaction can lead to duplicated rows when running the compaction after multiple merge statements. This issue can be reproduced for example by running the following queries:
CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES ('transactional'='true'); INSERT INTO transactions VALUES (1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, 'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08'); CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC; INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, 'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, 'value_11'),(12, 'value_12'); MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET value = S.value WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC; INSERT INTO merge_source_2 VALUES (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, 'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14'); MERGE INTO transactions AS T USING merge_source_2 AS S ON T.ID = S.ID WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET value = S.value WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); ALTER TABLE transactions COMPACT 'MINOR'; CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC; INSERT INTO merge_source_3 VALUES (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, 'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, 'value_15'); MERGE INTO transactions AS T USING merge_source_3 AS S ON T.ID = S.ID WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET value = S.value WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); ALTER TABLE transactions COMPACT 'MINOR';
Running a select after the second compaction finished will return duplicated rows:
select * from transactions order by id; +++  transactions.id  transactions.value  +++  1  newvalue_1   1  latestvalue_1   2  newestvalue_2   2  newvalue_2   3  value_03   4  latestvalue_4   4  newvalue_4   5  latestvalue_5   6  newvalue_6   6  newestvalue_6   7  value_07   8  value_08   9  latestvalue_9   10  newestvalue_10   11  latestvalue_11   12  value_12   13  latestvalue_13   14  value_14   15  value_15  +++
If the same queries are run with MR MINOR compaction, instead of the querybased MINOR compaction, the select will return the correct result:
+++  transactions.id  transactions.value  +++  1  latestvalue_1   2  newestvalue_2   3  value_03   4  latestvalue_4   5  latestvalue_5   6  newestvalue_6   7  value_07   8  value_08   9  latestvalue_9   10  newestvalue_10   11  latestvalue_11   12  value_12   13  latestvalue_13   14  value_14   15  value_15  +++
The content of the bucket files in the delta and delete delta directories after the querybased and MR compactions look like this.
Querybased
Processing data file tmp/transactions/delta_0000001_0000004_v0000429/bucket_00000 [length: 947] {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":1,"row":{"id":1,"value":"value_01"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":1,"row":{"id":2,"value":"value_02"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":2,"currenttransaction":1,"row":{"id":3,"value":"value_03"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":1,"row":{"id":4,"value":"value_04"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":1,"row":{"id":5,"value":"value_05"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":1,"row":{"id":6,"value":"value_06"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":6,"currenttransaction":1,"row":{"id":7,"value":"value_07"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":7,"currenttransaction":1,"row":{"id":8,"value":"value_08"}} {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":{"id":9,"value":"value_9"}} {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":{"id":10,"value":"value_10"}} {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":2,"row":{"id":11,"value":"value_11"}} {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":{"id":12,"value":"value_12"}} {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":3,"row":{"id":13,"value":"value_13"}} {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":{"id":14,"value":"value_14"}} {"operation":0,"originaltransaction":4,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":{"id":15,"value":"value_15"}} {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":2,"row":{"id":1,"value":"newvalue_1"}} {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":2,"row":{"id":2,"value":"newvalue_2"}} {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":2,"row":{"id":4,"value":"newvalue_4"}} {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":2,"row":{"id":6,"value":"newvalue_6"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":0,"currenttransaction":3,"row":{"id":10,"value":"newestvalue_10"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":{"id":11,"value":"newestvalue_11"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":{"id":2,"value":"newestvalue_2"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":{"id":4,"value":"newestvalue_4"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":4,"currenttransaction":3,"row":{"id":6,"value":"newestvalue_6"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":{"id":5,"value":"latestvalue_5"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":{"id":9,"value":"latestvalue_9"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":2,"currenttransaction":4,"row":{"id":1,"value":"latestvalue_1"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":{"id":13,"value":"latestvalue_13"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":4,"currenttransaction":4,"row":{"id":11,"value":"latestvalue_11"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":5,"currenttransaction":4,"row":{"id":4,"value":"latestvalue_4"}} Processing data file tmp/transactions/delete_delta_0000001_0000004_v0000429/bucket_00000 [length: 713] {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":null} {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":null} {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":null} {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":2,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":null} _____________________________________________________________________________________________________________________
MR
Processing data file tmp/transactions/delta_0000001_0000004_v0000479/bucket_00000 [length: 1002] {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,"row":{"id":1,"value":"value_01"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":1,"row":{"id":2,"value":"value_02"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":2,"currentTransaction":1,"row":{"id":3,"value":"value_03"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":1,"row":{"id":4,"value":"value_04"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":1,"row":{"id":5,"value":"value_05"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":1,"row":{"id":6,"value":"value_06"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":6,"currentTransaction":1,"row":{"id":7,"value":"value_07"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":7,"currentTransaction":1,"row":{"id":8,"value":"value_08"}} {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":{"id":9,"value":"value_9"}} {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":{"id":10,"value":"value_10"}} {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":2,"row":{"id":11,"value":"value_11"}} {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":{"id":12,"value":"value_12"}} {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":2,"row":{"id":1,"value":"newvalue_1"}} {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":2,"row":{"id":2,"value":"newvalue_2"}} {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":2,"row":{"id":4,"value":"newvalue_4"}} {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":2,"row":{"id":6,"value":"newvalue_6"}} {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":{"id":13,"value":"value_13"}} {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":{"id":14,"value":"value_14"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":0,"currentTransaction":3,"row":{"id":10,"value":"newestvalue_10"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":{"id":11,"value":"newestvalue_11"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":{"id":2,"value":"newestvalue_2"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":{"id":4,"value":"newestvalue_4"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":4,"currentTransaction":3,"row":{"id":6,"value":"newestvalue_6"}} {"operation":0,"originalTransaction":4,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":{"id":15,"value":"value_15"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":{"id":5,"value":"latestvalue_5"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":{"id":9,"value":"latestvalue_9"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":2,"currentTransaction":4,"row":{"id":1,"value":"latestvalue_1"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":{"id":13,"value":"latestvalue_13"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":4,"currentTransaction":4,"row":{"id":11,"value":"latestvalue_11"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":5,"currentTransaction":4,"row":{"id":4,"value":"latestvalue_4"}} _____________________________________________________________________________________________________________________ Processing data file tmp/transactions/delete_delta_0000001_0000004_v0000479/bucket_00000 [length: 632] {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":null} {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":null} {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":null} {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":2,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":null} _____________________________________________________________________________________________________________________
It can be seen that when the querybased MINOR compaction was used, the sorting order of the rows is "bucket, originalTransactionId, rowId". But when the MR MINOR compaction was used, the order is "originalTransactionId, bucket, rowId".
The ordering in the querybased compaction has to be fixed to be aligned with the MR compaction's ordering.
 links to