Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-4044

When reading data from flink-hudi to external storage, the result is incorrect

    XMLWordPrintableJSON

Details

    Description

      When reading data from flink-hudi to external storage, the result is incorrect  because of concurrency issues:
       
      Here's the  case:
       
      There is a split_monitor task that listens for changes on the TimeLine every N seconds; There are four split_reader tasks for processing changing data and sinking data to external storage:
       
      (1) First,split_monitor listens to Instance1 changes , and the corresponding fileId is log1. Split_monitor distributes the fileId information to split_reader task 1 in Rebanlance mode for processing.
       
      (2) then,split_monitor listens for Instance2 change . The corresponding fileId is log1 (assuming that the changed data have the same primary key ). The split_monitor task distributes fileId information to split_reader task 2 in Rebanlance mode for processing.
       
      (3) Split_reader task 1 and split_reader task 2 process the same primary key data, and their processing speeds are inconsistent. As a result, the sequence of data sink to external storage is inconsistent. The data modified earlier overwrites the data modified later, resulting in incorrect data.
       
       
      Solution:
      After the split_monitor task monitors the data changes, it distributes them to the split_reader task through the FileId Hash mode to ensure that the same FileId files are processed in the same split_reader task, thus solving this problem .

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              aliceyyan yanxiang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: