[HUDI-7867] Data deduplication caused by drawback in the delete invalid files before commit - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0, 0.15.1
Component/s: core
Labels:
- pull-request-available

Description

Our user complained that after their daily run job which written to a Hudi cow table finished, the downstream reading jobs find many duplicate records today. The daily run job has been already online for a long time, and this is the first time of such wrong result.
He gives a detailed deduplicated record as example to help debug. The record appeared in 3 base files which belongs to different file groups.

I find the today's writer job, the spark application finished successfully.
In the driver log, I find those two files marked as invalid files which to delete, only one file is valid files.

And in the clean stage task log, those two files are also marked to be deleted and there is no exception in the task either.

Those two files already existed on the hdfs before the clean stage began, but they still existed after the clean stage.

Finally, found the root cause is some corner case happened in hdfs. And fs.delete does not throw any exception, only return false if the hdfs does not delete the file successfully.

And I check the fs.delete api, the definition is reasonable.

Attachments

Issue Links

links to

GitHub Pull Request #11445

Activity

People

Assignee:: Jing Zhang

Reporter:: Jing Zhang

Reviewers:: Danny Chen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Jun/24 13:43

Updated:: 10/Sep/24 01:30

Resolved:: 10/Sep/24 01:30