Description
On a new table with primary key _row_key and partitioned by partition_path, if you do a bulk insert by:
insertDf.createOrReplaceTempView("insert_temp_table") spark.sql(s"set hoodie.datasource.write.operation=bulk_insert") spark.sql("set hoodie.sql.bulk.insert.enable=true") spark.sql("set hoodie.sql.insert.mode=non-strict") spark.sql(s"insert into $tableName select * from insert_temp_table")
you will get data with: bad_data.txt where multiple records have the same key even though they have different primary key values, and that there are multiple files even though there are only 10 records
changing hoodie.datasource.write.operation=bulk_insert to hoodie.datasource.write.operation=insert causes the data to be inserted correctly. I do not know if it is using bulk insert with this change.
However, if you use bulk insert with raw data like
spark.sql(s""" | insert into $tableName values | $values |""".stripMargin )
where $values is something like
(1, 'a1', 10, 1000, "2021-01-05"),
then hoodie.datasource.write.operation=bulk_insert works as expected