[HUDI-4765] Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.11.1
Fix Version/s: 0.12.1
Component/s: spark, spark-sql
Labels:
None
Environment:
Spark 3.1.1
Hudi 0.11.1

Story Points:
2

Description

Create table using spark-sql:

create table hudi_mor_tbl (
  id int,
  name string,
  price double,
  ts bigint
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
)
location 'hdfs:///hudi/hudi_mor_tbl';

And then insert data via spark-shell and spark-sql respectively:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val fields = Array(
      StructField("id", IntegerType, true),
      StructField("name", StringType, true),
      StructField("price", DoubleType, true),
      StructField("ts", LongType, true)
  )
val simpleSchema = StructType(fields)
val data = Seq(Row(2, "a2", 200.0, 100L))
val df = spark.createDataFrame(data, simpleSchema)
df.write.format("hudi").
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "id").
  option(TABLE_NAME, "hudi_mor_tbl").
  option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ").
  mode(Append).
  save("hdfs:///hudi/hudi_mor_tbl")

insert into hudi_mor_tbl select 1, 'a1', 20, 1000;

After that we query the table, we can see those two rows are as below:

+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|price|  ts|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
|  20220902012710792|20220902012710792...|                 2|                      |c3eff8c8-fa47-48c...|  2|  a2|200.0| 100|
|  20220902012813658|20220902012813658...|              id:1|                      |c3eff8c8-fa47-48c...|  1|  a1| 20.0|1000|
+-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+

'_hoodie_record_key' field for spark_sql inserted data is 'id:1' while that for spark-shell is 2. It seems that spark_sql uses '[primaryKey_field_name]:[primaryKey_field_value]' to construct the '_hoodie_record_key' field, which is different from spark-shell.

As a result, if we inserted one row via spark-sql and then upserted it via spark-shell, we would get two duplicated rows. That is not what we expected.

Did I miss some configurations that might lead to this issue? If not, personally I think we should make the default record key generation logic consistent.

Attachments

Issue Links

is fixed by

HUDI-4813 Infer keygen not work in sparksql side

Closed

Activity

People

Assignee:: Raymond Xu

Reporter:: Yao Zhang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Sep/22 09:52

Updated:: 30/Nov/22 13:20

Resolved:: 30/Sep/22 15:52