Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7332

The best way to force cleaning hoodie metadata

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      We have spark structured streaming job writing data to hudi tables. After an upgrade to hudi 0.11, we found that we have thousands of files under hoodie metadata which were not cleaned or archived. This impacts the overall processing of the streaming job. I found similar issue in https://github.com/apache/hudi/issues/7472 and they mentioned this was fixed in 0.13. Since we have the issue in Prod, we will not be able to upgrade to 0.13 for now. I found that I can run sperate spark submit job to execute HoodieCleaner. I also found that deleting hudie metadata from hudi-cli could be an option but I am not sure if its safe to use that approach as we are using upsert hudi operation in the streaming job.
      Please advise what is the best way to force cleaning and archiving the metadata files.

      Attachments

        Activity

          People

            Unassigned Unassigned
            haitham Haitham Eltaweel
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: