Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-4515

savepoints will be clean in keeping latest versions policy

    XMLWordPrintableJSON

Details

    • 0.5

    Description

      When I tested the behavior of clean and savepoint, I found that when clean is keeping latest versions, the files of savepoint will be deleted. By reading the code, I found that this should be a bug

       

      For example, if I use "HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS", and set the “hoodie.cleaner.fileversions.retained” to 2, I do the following:
      1. insert, get xxxx_001.parquet
      2. savepoint
      3. insert, get xxxx_002.parquet
      4. insert, get xxxx_003.parquet
      After the fourth step, the xxxx_001.parquet will be deleted even if it belongs to savepoint !

       

      here is: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: getFilesToCleanKeepingLatestVersions

      • According to the following code, on the one hand, the checkpoints belonging to keepversion will be skipped and will not be counted in the calculation of keepversion, which I feel is unreasonable.
      • On the other hand, if there is a checkpoint in the remaining version of the files, it will be deleted, which I don't think is in line with the design philosophy of savepoints.
      while (fileSliceIterator.hasNext() && keepVersions > 0) {
        // Skip this most recent version
        FileSlice nextSlice = fileSliceIterator.next();
        Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
        if (dataFile.isPresent() && savepointedFiles.contains(dataFile.get().getFileName())) {
          // do not clean up a savepoint data file
          continue;
        }
        keepVersions--;
      }
      // Delete the remaining files
      while (fileSliceIterator.hasNext()) {
        FileSlice nextSlice = fileSliceIterator.next();
        deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
      }

       

      So I think the judgment logic of the checkpoint should be moved down, if can be fixed by this:

      while (fileSliceIterator.hasNext() && keepVersions > 0) {
        // Skip this most recent version
        fileSliceIterator.next();
        keepVersions--;
      }
      // Delete the remaining files
      while (fileSliceIterator.hasNext()) {
        FileSlice nextSlice = fileSliceIterator.next();
        Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
        if (dataFile.isPresent() && savepointedFiles.contains(dataFile.get().getFileName())) {
          // do not clean up a savepoint data file
          continue;
        }
        deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
      }

       

      Thanks.

      Attachments

        Issue Links

          Activity

            People

              zouxxyy Xinyu Zou
              zouxxyy Xinyu Zou
              sivabalan narayanan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: