Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
The query based MINOR compaction uses the following sorting order in its inner query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in the code.
But actually the rows should be ordered by originalTransactionId, bucketProperty and rowId, otherwise the delete deltas cannot be applied correctly. And this is the order what the MR MAJOR and MR MINOR compactions write.
The sorting order used by the query-based MINOR compaction can lead to duplicated rows when running the compaction after multiple merge statements. This issue can be reproduced for example by running the following queries:
CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES ('transactional'='true'); INSERT INTO transactions VALUES (1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, 'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08'); CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC; INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, 'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, 'value_11'),(12, 'value_12'); MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET value = S.value WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC; INSERT INTO merge_source_2 VALUES (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, 'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14'); MERGE INTO transactions AS T USING merge_source_2 AS S ON T.ID = S.ID WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET value = S.value WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); ALTER TABLE transactions COMPACT 'MINOR'; CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC; INSERT INTO merge_source_3 VALUES (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, 'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, 'value_15'); MERGE INTO transactions AS T USING merge_source_3 AS S ON T.ID = S.ID WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET value = S.value WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); ALTER TABLE transactions COMPACT 'MINOR';
Running a select after the second compaction finished will return duplicated rows:
select * from transactions order by id; +------------------+---------------------+ | transactions.id | transactions.value | +------------------+---------------------+ | 1 | newvalue_1 | | 1 | latestvalue_1 | | 2 | newestvalue_2 | | 2 | newvalue_2 | | 3 | value_03 | | 4 | latestvalue_4 | | 4 | newvalue_4 | | 5 | latestvalue_5 | | 6 | newvalue_6 | | 6 | newestvalue_6 | | 7 | value_07 | | 8 | value_08 | | 9 | latestvalue_9 | | 10 | newestvalue_10 | | 11 | latestvalue_11 | | 12 | value_12 | | 13 | latestvalue_13 | | 14 | value_14 | | 15 | value_15 | +------------------+---------------------+
If the same queries are run with MR MINOR compaction, instead of the query-based MINOR compaction, the select will return the correct result:
+------------------+---------------------+ | transactions.id | transactions.value | +------------------+---------------------+ | 1 | latestvalue_1 | | 2 | newestvalue_2 | | 3 | value_03 | | 4 | latestvalue_4 | | 5 | latestvalue_5 | | 6 | newestvalue_6 | | 7 | value_07 | | 8 | value_08 | | 9 | latestvalue_9 | | 10 | newestvalue_10 | | 11 | latestvalue_11 | | 12 | value_12 | | 13 | latestvalue_13 | | 14 | value_14 | | 15 | value_15 | +------------------+---------------------+
The content of the bucket files in the delta and delete delta directories after the query-based and MR compactions look like this.
Query-based
Processing data file tmp/transactions/delta_0000001_0000004_v0000429/bucket_00000 [length: 947] {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":1,"row":{"id":1,"value":"value_01"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":1,"row":{"id":2,"value":"value_02"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":2,"currenttransaction":1,"row":{"id":3,"value":"value_03"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":1,"row":{"id":4,"value":"value_04"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":1,"row":{"id":5,"value":"value_05"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":1,"row":{"id":6,"value":"value_06"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":6,"currenttransaction":1,"row":{"id":7,"value":"value_07"}} {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":7,"currenttransaction":1,"row":{"id":8,"value":"value_08"}} {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":{"id":9,"value":"value_9"}} {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":{"id":10,"value":"value_10"}} {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":2,"row":{"id":11,"value":"value_11"}} {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":{"id":12,"value":"value_12"}} {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":3,"row":{"id":13,"value":"value_13"}} {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":{"id":14,"value":"value_14"}} {"operation":0,"originaltransaction":4,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":{"id":15,"value":"value_15"}} {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":2,"row":{"id":1,"value":"newvalue_1"}} {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":2,"row":{"id":2,"value":"newvalue_2"}} {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":2,"row":{"id":4,"value":"newvalue_4"}} {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":2,"row":{"id":6,"value":"newvalue_6"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":0,"currenttransaction":3,"row":{"id":10,"value":"newestvalue_10"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":{"id":11,"value":"newestvalue_11"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":{"id":2,"value":"newestvalue_2"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":{"id":4,"value":"newestvalue_4"}} {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":4,"currenttransaction":3,"row":{"id":6,"value":"newestvalue_6"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":{"id":5,"value":"latestvalue_5"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":{"id":9,"value":"latestvalue_9"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":2,"currenttransaction":4,"row":{"id":1,"value":"latestvalue_1"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":{"id":13,"value":"latestvalue_13"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":4,"currenttransaction":4,"row":{"id":11,"value":"latestvalue_11"}} {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":5,"currenttransaction":4,"row":{"id":4,"value":"latestvalue_4"}} Processing data file tmp/transactions/delete_delta_0000001_0000004_v0000429/bucket_00000 [length: 713] {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":null} {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":null} {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":null} {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":2,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":null} {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":null} {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":null} _____________________________________________________________________________________________________________________
MR
Processing data file tmp/transactions/delta_0000001_0000004_v0000479/bucket_00000 [length: 1002] {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,"row":{"id":1,"value":"value_01"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":1,"row":{"id":2,"value":"value_02"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":2,"currentTransaction":1,"row":{"id":3,"value":"value_03"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":1,"row":{"id":4,"value":"value_04"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":1,"row":{"id":5,"value":"value_05"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":1,"row":{"id":6,"value":"value_06"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":6,"currentTransaction":1,"row":{"id":7,"value":"value_07"}} {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":7,"currentTransaction":1,"row":{"id":8,"value":"value_08"}} {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":{"id":9,"value":"value_9"}} {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":{"id":10,"value":"value_10"}} {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":2,"row":{"id":11,"value":"value_11"}} {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":{"id":12,"value":"value_12"}} {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":2,"row":{"id":1,"value":"newvalue_1"}} {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":2,"row":{"id":2,"value":"newvalue_2"}} {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":2,"row":{"id":4,"value":"newvalue_4"}} {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":2,"row":{"id":6,"value":"newvalue_6"}} {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":{"id":13,"value":"value_13"}} {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":{"id":14,"value":"value_14"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":0,"currentTransaction":3,"row":{"id":10,"value":"newestvalue_10"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":{"id":11,"value":"newestvalue_11"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":{"id":2,"value":"newestvalue_2"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":{"id":4,"value":"newestvalue_4"}} {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":4,"currentTransaction":3,"row":{"id":6,"value":"newestvalue_6"}} {"operation":0,"originalTransaction":4,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":{"id":15,"value":"value_15"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":{"id":5,"value":"latestvalue_5"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":{"id":9,"value":"latestvalue_9"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":2,"currentTransaction":4,"row":{"id":1,"value":"latestvalue_1"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":{"id":13,"value":"latestvalue_13"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":4,"currentTransaction":4,"row":{"id":11,"value":"latestvalue_11"}} {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":5,"currentTransaction":4,"row":{"id":4,"value":"latestvalue_4"}} _____________________________________________________________________________________________________________________ Processing data file tmp/transactions/delete_delta_0000001_0000004_v0000479/bucket_00000 [length: 632] {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":null} {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":null} {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":null} {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":2,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":null} {"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":null} {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":null} _____________________________________________________________________________________________________________________
It can be seen that when the query-based MINOR compaction was used, the sorting order of the rows is "bucket, originalTransactionId, rowId". But when the MR MINOR compaction was used, the order is "originalTransactionId, bucket, rowId".
The ordering in the query-based compaction has to be fixed to be aligned with the MR compaction's ordering.
Attachments
Issue Links
- links to