Details
Description
We have spark structured streaming job writing data to hudi tables. After an upgrade to hudi 0.11, we found that we have thousands of files under hoodie metadata which were not cleaned or archived. This impacts the overall processing of the streaming job. I found similar issue in https://github.com/apache/hudi/issues/7472 and they mentioned this was fixed in 0.13. Since we have the issue in Prod, we will not be able to upgrade to 0.13 for now. I found that I can run sperate spark submit job to execute HoodieCleaner. I also found that deleting hudie metadata from hudi-cli could be an option but I am not sure if its safe to use that approach as we are using upsert hudi operation in the streaming job.
Please advise what is the best way to force cleaning and archiving the metadata files.