[HUDI-7100] Data loss when using insert_overwrite_table with insert.drop.duplicates - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.14.1
Component/s: writer-core
Labels:
- pull-request-available

Description

Code to reproduce -

Github Issue - https://github.com/apache/hudi/issues/9967

```
schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
]
)

data = [
Row(1, "a"),
Row(2, "a"),
Row(3, "c"),
]

hudi_configs =

{ "hoodie.table.name": TABLE_NAME, "hoodie.datasource.write.recordkey.field": "name", "hoodie.datasource.write.precombine.field": "id", "hoodie.datasource.write.operation":"insert_overwrite_table", "hoodie.table.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator", }

df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)

spark.read.format("hudi").load(PATH).show()
– Showing no records
```

df.write.format("org.apache.hudi").options(**hudi_configs).option("hoodie.datasource.write.insert.drop.duplicates","true").mode("append").save(PATH)

spark.read.format("hudi").load(PATH).show()

Attachments

Issue Links

links to

GitHub Pull Request #10222

Activity

People

Assignee:: sivabalan narayanan

Reporter:: Aditya Goenka

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 15/Nov/23 11:55

Updated:: 05/Jun/24 04:20

Resolved:: 04/Jan/24 00:17