Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6155

Fix cleaner based on hours for earliest commit to retain

    XMLWordPrintableJSON

Details

    Description

      When cleaner is based on hours, we estimate the earliest commit to retain based on current time zone and not UTC or the timezone used to generate the commit time. so, there could be some mis-calculations and lead to deleting additional slices. 

       

      Ref: https://github.com/apache/hudi/blob/c6760772f8dc62eb44c45b022ed07858d895d804/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L511

       

      else if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
        Instant instant = Instant.now();
        ZonedDateTime currentDateTime = ZonedDateTime.ofInstant(instant, ZoneId.systemDefault());
        String earliestTimeToRetain = HoodieActiveTimeline.formatDate(Date.from(currentDateTime.minusHours(hoursRetained).toInstant()));
        earliestCommitToRetain = Option.fromJavaOptional(commitTimeline.getInstantsAsStream().filter(i -> HoodieTimeline.compareTimestamps(i.getTimestamp(),
                HoodieTimeline.GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst());
      } 

       

       

      Potential fixes:

      • Fix the time based on time zone set in table config. 
      • Fetch the latest completed commit and decide the earliest commit based on that.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shivnarayan sivabalan narayanan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: