Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.6.0
Description
In [UpsertPartitioner|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java], when getPartition(Object key) is called, the logic to determine where the record to be inserted is relying on globalInsertCounts where as this should be perPartitionInsertCount.
Bcoz, the weights for all targetInsert buckets are determined based on total Inserts going into the partition of interest. // check like 200. Whereas when getPartition(key) is called, we use global insert count to determine the right bucket.
For instance,
P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be inserted is 100.
P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be inserted is 10025.
So, ideal allocation is
P1: B0 -> 20, B1 -> 50, B2 -> 30
P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503
getPartition() for a key is determined based on following.
mod (hash value, inserts)/ inserts.
Instead of considering inserts for the partition of interest, currently we take global insert counts.
Lets say, these are the hash values for insert records in P1.
5, 14, 20, 25, 90, 500, 1001, 5180.
record hash | expected bucket in P1 | actual bucket in P1 |
5 | B0 | B0
14 | B0 | B0
21 | B1 | B0
30 | B1 | B0
90 | B2 | B0
500 | B0 | B0
1490 | B2 | B1
10019 | B0 | B3
Attachments
Issue Links
- links to