Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-4765

Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.11.1
    • 0.12.1
    • spark, spark-sql
    • None
    • Spark 3.1.1
      Hudi 0.11.1
    • 2

    Description

      Create table using spark-sql:

      create table hudi_mor_tbl (
        id int,
        name string,
        price double,
        ts bigint
      ) using hudi
      tblproperties (
        type = 'mor',
        primaryKey = 'id',
        preCombineField = 'ts'
      )
      location 'hdfs:///hudi/hudi_mor_tbl'; 

      And then insert data via spark-shell and spark-sql respectively:

      import org.apache.spark.sql._
      import org.apache.spark.sql.types._
      val fields = Array(
            StructField("id", IntegerType, true),
            StructField("name", StringType, true),
            StructField("price", DoubleType, true),
            StructField("ts", LongType, true)
        )
      val simpleSchema = StructType(fields)
      val data = Seq(Row(2, "a2", 200.0, 100L))
      val df = spark.createDataFrame(data, simpleSchema)
      df.write.format("hudi").
        option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        option(RECORDKEY_FIELD_OPT_KEY, "id").
        option(TABLE_NAME, "hudi_mor_tbl").
        option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ").
        mode(Append).
        save("hdfs:///hudi/hudi_mor_tbl") 
      insert into hudi_mor_tbl select 1, 'a1', 20, 1000; 

      After that we query the table, we can see those two rows are as below:

      +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
      |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|name|price|  ts|
      +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
      |  20220902012710792|20220902012710792...|                 2|                      |c3eff8c8-fa47-48c...|  2|  a2|200.0| 100|
      |  20220902012813658|20220902012813658...|              id:1|                      |c3eff8c8-fa47-48c...|  1|  a1| 20.0|1000|
      +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+ 

      '_hoodie_record_key' field for spark_sql inserted data is 'id:1' while that for spark-shell is 2. It seems that spark_sql uses '[primaryKey_field_name]:[primaryKey_field_value]' to construct the '_hoodie_record_key' field, which is different from spark-shell.

      As a result, if we inserted one row via spark-sql and then upserted it via spark-shell, we would get two duplicated rows. That is not what we expected.

      Did I miss some configurations that might lead to this issue? If not, personally I think we should make the default record key generation logic consistent.

      Attachments

        Issue Links

          Activity

            People

              xushiyan Raymond Xu
              paul8263 Yao Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: