[HUDI-4992] Spark Row-writing Bulk Insert produces incorrect Bloom Filter metadata - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.12.0
Fix Version/s: 0.12.1
Component/s: None
Labels:
- pull-request-available

Story Points:
2
Epic Link:
Hudi Spark Datasource

Description

Troubleshooting duplicates issue w/ Abhishek Modi from Notion, we've found that the min/max record key stats are being currently persisted incorrectly into Parquet metadata, leading to duplicate records being produced in their pipeline after initial bulk-insert.

Attachments

Issue Links

is a parent of

HUDI-5051 Add a functional regression test for Bloom Index followed on w/ Upserts

Closed

links to

GitHub Pull Request #6883

Activity

People

Assignee:: Alexey Kudinkin

Reporter:: Alexey Kudinkin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 06/Oct/22 23:13

Updated:: 18/Oct/22 20:13

Resolved:: 07/Oct/22 15:42