Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7226

Clean by hour does not respect lastVersionBeforeEarliestCommitToRetain

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      org.apache.hudi.table.action.clean.CleanPlanner#getFilesToCleanKeepingLatestCommits(java.lang.String, int, org.apache.hudi.common.model.HoodieCleaningPolicy)

      lastVersionBeforeEarliestCommitToRetain is not honored by KEEP_LATEST_BY_HOURS policy. This essentially makes cleaner to remove the file slice when it becomes non-latest. This could fail long-running queries in a race condition:

      1. timeline contains a t0.deltacommit (not cleaned because it's latest)
      2. a snapshot query starts and running
      3. compaction runs and creates t1.commit
      4. cleaner runs and remove t0 (because now t1.commit is the latest)
      5. the query failed due to a log file belongs to t0.deltacommit is not found

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tim.brown Timothy Brown Assign to me
            xushiyan Shiyan Xu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment