Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-5257

Spark-Sql duplicates and re-uses record keys under certain configs and use cases

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • bootstrap, spark-sql
    • None

    Description

      On a new table with primary key  _row_key and partitioned by partition_path, if you do a bulk insert by:

      insertDf.createOrReplaceTempView("insert_temp_table")
      spark.sql(s"set hoodie.datasource.write.operation=bulk_insert")
      spark.sql("set hoodie.sql.bulk.insert.enable=true")
      spark.sql("set hoodie.sql.insert.mode=non-strict")
      spark.sql(s"insert into $tableName select * from insert_temp_table") 

      you will get data with: bad_data.txt where multiple records have the same key even though they have different primary key values, and that there are multiple files even though there are only 10 records

      changing hoodie.datasource.write.operation=bulk_insert to hoodie.datasource.write.operation=insert causes the data to be inserted correctly. I do not know if it is using bulk insert with this change. 

       

      However, if you use bulk insert with raw data like 

      spark.sql(s"""         
      | insert into $tableName values         
      | $values 
      |""".stripMargin
      )

      where $values is something like

      (1, 'a1', 10, 1000, "2021-01-05"), 

      then hoodie.datasource.write.operation=bulk_insert works as expected

      Attachments

        1. bad_data.txt
          5 kB
          Jonathan Vexler

        Activity

          People

            jonvex Jonathan Vexler
            jonvex Jonathan Vexler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: