Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1292 [Umbrella] RFC-15 : File Listing and Query Planning Optimizations
  3. HUDI-2476

Fix retried compaction commit in datatable fails when applied to metadata w/ sync updates

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: Writer Core
    • Labels:
      None

      Description

      Compaction and clustering has a static instant time. So, when retried it may not have a new instant time, but the same. 

      So, lets walk through the scenario of what happens when compaction fails after synced to metadata table. 

      c1, c2, cc3( compaction commit), c4. 

      c1, c2, c4 on completion is applied to metadata table. 

      cc3 also gets synced to metadata table, but before committing to data table, it failed(process crashed). Its is a small window, but still a possibility. 

      So, from a timeline standpoint this is what looks like

      data timeline: 

      c1 complete, c2 complete, cc3 inflight. c4 complete.

      metadata timeline:

      dc_c1, dc_c2, dc_cc3, dc_c4

      Lets say there are few more commits went in. 

      data timeline: 

      c1 complete, c2 complete, cc3 inflight. c4 complete, c5 complete. c6 complete.

      metadata timeline:

      dc_c1, dc_c2, dc_cc3, dc_c4, dc_c5, dc_c6

       

      Now, compaction in datatable is being re-attempted. So, first we rollback pending compaction in data table. So, this will trigger an upsert to metadata table. even thought this is a rollback, all updates to metadata table is an upsert which would result in a delta table. 

      And then, the compaction will be retried in datatable. when this is nearing completion, we try to upsert to metadata table. which will fail. because already we have a completed dc_cc3 in metadata table. 

       

      Fix: 

      when a commit is being retried, we delete the completed instant and then proceed with upsert. So, when log blocks/files are merged together, final state will be intact and will ensure only those files added in 2nd attempt is returned and those added during 1st attempt is not returned (since there will be complimenting log block corresponding to a rollback). 

       

       

       

        Attachments

          Activity

            People

            • Assignee:
              shivnarayan sivabalan narayanan
              Reporter:
              shivnarayan sivabalan narayanan

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment