[HUDI-7332] The best way to force cleaning hoodie metadata - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: cleaning, hudi-utilities, metadata, table-service
Labels:
None
Environment:

Hide
Environment Description

Hudi version : 0.11.0

Spark version : 3.2.1

Amazon EMR : emr-6.11.1

Hadoop version : 3.2.1

Hive : 3.1.3

Storage (HDFS/S3/GCS..) : S3

Running on Docker? (yes/no) : No, yarn.

Show
Environment Description Hudi version : 0.11.0 Spark version : 3.2.1 Amazon EMR : emr-6.11.1 Hadoop version : 3.2.1 Hive : 3.1.3 Storage (HDFS/S3/GCS..) : S3 Running on Docker? (yes/no) : No, yarn.

Description

We have spark structured streaming job writing data to hudi tables. After an upgrade to hudi 0.11, we found that we have thousands of files under hoodie metadata which were not cleaned or archived. This impacts the overall processing of the streaming job. I found similar issue in https://github.com/apache/hudi/issues/7472 and they mentioned this was fixed in 0.13. Since we have the issue in Prod, we will not be able to upgrade to 0.13 for now. I found that I can run sperate spark submit job to execute HoodieCleaner. I also found that deleting hudie metadata from hudi-cli could be an option but I am not sure if its safe to use that approach as we are using upsert hudi operation in the streaming job.
Please advise what is the best way to force cleaning and archiving the metadata files.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Haitham Eltaweel

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 25/Jan/24 17:39

Updated:: 29/Mar/24 02:15