Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6047

Clustering operation on consistent hashing resulting in duplicate data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.13.1
    • None

    Description

      Hudi chooses consistent hashing committed bucket metadata file on the basis of replace commit logged on hudi active timeline. but once hudi archives timeline, it falls back to default consistent hashing bucket metadata that is 00000000000000.hashing_meta  , which result in writing duplicate records in the table . 

      above behaviour results in duplicate data in the hudi table and  failing in subsequent clustering operation as there is inconsistency between file groups on storage vs file groups in metadata files

       

      Check the loadMetadata function of consistent hashing index implementation.

      https://github.com/apache/hudi/blob/4da64686cfbcb6471b1967091401565f58c835c7/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bucket/HoodieSparkConsistentBucketIndex.java#L190

       

      let me know if anything else is needed.

      Attachments

        Activity

          People

            Unassigned Unassigned
            imrv13 Rohan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: