[HUDI-5257] Spark-Sql duplicates and re-uses record keys under certain configs and use cases - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: bootstrap, spark-sql
Labels:
None

Description

On a new table with primary key _row_key and partitioned by partition_path, if you do a bulk insert by:

insertDf.createOrReplaceTempView("insert_temp_table")
spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
spark.sql("set hoodie.sql.bulk.insert.enable=true")
spark.sql("set hoodie.sql.insert.mode=non-strict")
spark.sql(s"insert into $tableName select * from insert_temp_table")

you will get data with: bad_data.txt where multiple records have the same key even though they have different primary key values, and that there are multiple files even though there are only 10 records

changing hoodie.datasource.write.operation=bulk_insert to hoodie.datasource.write.operation=insert causes the data to be inserted correctly. I do not know if it is using bulk insert with this change.

However, if you use bulk insert with raw data like

spark.sql(s"""         
| insert into $tableName values         
| $values 
|""".stripMargin
)

where $values is something like

(1, 'a1', 10, 1000, "2021-01-05"),

then hoodie.datasource.write.operation=bulk_insert works as expected

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

bad_data.txt
21/Nov/22 19:59
5 kB
Jonathan Vexler

Activity

People

Assignee:: Jonathan Vexler

Reporter:: Jonathan Vexler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Nov/22 20:06

Updated:: 01/Jun/23 00:47

Resolved:: 01/Jun/23 00:47