Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1347

Hbase index partition changes cause data duplication problems

    XMLWordPrintableJSON

Details

    Description

      1,A piece of data repeatedly changes the partition. After the data deduplication operation, the partition information of the key and data in the HoodieRecord object is inconsistent.

      E.g:

      id,oid,name,dt,isdeleted,lastupdatedttm,rowkey
      9,1,aaaa,2018,0,2020-02-17 00:50:25.000001,00_test1-9-1
      9,1,aaaa,2019,0,2020-02-17 00:50:25.000002,00_test1-9-1

      rowkey is the primary key and dt is the partition. After deduplication, the key of the HoodieRecord object is (00_test1-9-1,2018).The key should be (00_test1-9-1,2019)

      2,An exception in the hudi task caused the hbase index to be written successfully but the task failed. If the task is retried, the partition change data becomes only a new creation. The data before the partition change is not deleted.

      Solution:

      1,Fixed the error of partition information in HoodieRecord key caused by deduplication operation

      2.The hbase index adds a rollback operation instead of doing nothing. The partition change needs to be rolledback to the index of the last successful commit。

      3.Rich test cases

       

       

      Attachments

        Issue Links

          Activity

            People

              hj324545 jing
              hj324545 jing
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: