Description
hive read mor rt table, will get duplicated record in below case:
- using bucket index type
- say primary key 1 - 100, set bucket number to 1
- insert 1 - 100 record ,compact it , one parquet file will be generated
- insert 1 - 100 record once again, but dont't compact it, so the data file will contain 1 parquet file + 1 log file.
- select * from table where key=1, you will get 2 record.
the cause is :
in HoodieMergeOnReadtableInputFormat ,the isSplitable will return true, it will generate two map task, each task include the log file, so each task will return one record.
pls refer this: