Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6352

KEEP_LATEST_BY_HOURS should consider modified time instead of commit time while setting earliestCommitToRetain value

    XMLWordPrintableJSON

Details

    Description

      In CleanPlanner, KEEP_LATEST_BY_HOURS is setting earliestCommitToRetain value by consider timestamp directly, this will introduce bug if there are out of order commits where commit with lower timestamp is completed much later than commits with higher timestamps.

      This policy's implementation needs to be revisit.

      It should basically store the timestamp until which it cleaned let this be t1. Next cleaner instant should consider all the partitions and files that are modified from the point of t1 until (currentime-x) hours. Whichever files are not valid they should be removed.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              suryaprasanna Surya Prasanna Yalla
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: