[HUDI-6047] Clustering operation on consistent hashing resulting in duplicate data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.13.1
Component/s: None
Labels:
- pull-request-available

Description

Hudi chooses consistent hashing committed bucket metadata file on the basis of replace commit logged on hudi active timeline. but once hudi archives timeline, it falls back to default consistent hashing bucket metadata that is 00000000000000.hashing_meta , which result in writing duplicate records in the table .

above behaviour results in duplicate data in the hudi table and failing in subsequent clustering operation as there is inconsistency between file groups on storage vs file groups in metadata files

Check the loadMetadata function of consistent hashing index implementation.

https://github.com/apache/hudi/blob/4da64686cfbcb6471b1967091401565f58c835c7/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java#L190

let me know if anything else is needed.

Attachments

Issue Links

links to

GitHub Pull Request #8479

GitHub Pull Request #8503

Activity

People

Assignee:: Unassigned

Reporter:: Rohan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Apr/23 17:53

Updated:: 02/Dec/23 11:38

Resolved:: 10/May/23 02:20