Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-8248

Fix LogRecord reader to account for rollback blocks with higher timestamps

    XMLWordPrintableJSON

Details

    • 4

    Description

      With LogRecordReader, we also configure maxIntant time to read. Sometimes rollback blocks could have higher timestamps compared to the maxInstant set, which might lead to some data inconsistencies.  

       

      Lets go through an illustration:

      Say, we have t1.dc, t2.dc and t2.dc crashed mid way.
      Current layout is,
      base file(t1), lf1(partially committed data w/ t2 as instant time)
       
      Then we start t5.dc say. just when we start t5.dc, hudi detects pending commit and triggers a rollback. And this rollback will get an instant time of t6 (t6.rb). Note that rollback's commit time is greater than t5 or current ongoing delta commit.
      So, once rollback completes, this is the layout.
      base file, lf1(from t2.dc partially failed), lf3 (rollback command block with t6).
       
      And once t5.dc completes, this is how the layout looks like
      base file, lf1(from t2.dc partially failed), lf3 (rollback command block with t6). lf4 (from t5)
       
      At this point in time, when we trigger snapshot read or try to trigger tagLocation w/ global index, maxInstant is set to last entry among commits timeline which is t5. So, while LogRecordReader while processing all log blocks, when it reaches lf3, it detects the timestamp of t6 > t5 (i.e max instant time) and bails out of for loop. So, in essence it may not even read lf4 in above scenario.

       

      If lf1 and lf4 is referring to a delete block, it could lead to data consistency issues w/ global indexes when record moves from one partition to another. 

       

      Attachments

        Issue Links

          Activity

            People

              shivnarayan sivabalan narayanan
              shivnarayan sivabalan narayanan
              Danny Chen, Y Ethan Guo
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified