[HUDI-1082] Bug in deciding the upsert/insert buckets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6.0
Fix Version/s: 0.6.0
Component/s: writer-core
Labels:
- pull-request-available

Description

In [UpsertPartitioner|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java], when getPartition(Object key) is called, the logic to determine where the record to be inserted is relying on globalInsertCounts where as this should be perPartitionInsertCount.

Bcoz, the weights for all targetInsert buckets are determined based on total Inserts going into the partition of interest. // check like 200. Whereas when getPartition(key) is called, we use global insert count to determine the right bucket.

For instance,

P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be inserted is 100.

P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be inserted is 10025.

So, ideal allocation is

P1: B0 -> 20, B1 -> 50, B2 -> 30

P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503

getPartition() for a key is determined based on following.

mod (hash value, inserts)/ inserts.

Instead of considering inserts for the partition of interest, currently we take global insert counts.

Lets say, these are the hash values for insert records in P1.

5, 14, 20, 25, 90, 500, 1001, 5180.

record hash | expected bucket in P1 | actual bucket in P1 |

5 | B0 | B0

14 | B0 | B0

21 | B1 | B0

30 | B1 | B0

90 | B2 | B0

500 | B0 | B0

1490 | B2 | B1

10019 | B0 | B3

Attachments

Issue Links

links to

GitHub Pull Request #1838

Activity

People

Assignee:: shenh062326

Reporter:: sivabalan narayanan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Jul/20 14:41

Updated:: 26/Jul/20 02:30

Resolved:: 26/Jul/20 02:30