Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7100

Data loss when using insert_overwrite_table with insert.drop.duplicates

    XMLWordPrintableJSON

Details

    Description

      Code to reproduce - 

      Github Issue - https://github.com/apache/hudi/issues/9967

      ```
      schema = StructType(
      [
      StructField("id", IntegerType(), True),
      StructField("name", StringType(), True)
      ]
      )

      data = [
      Row(1, "a"),
      Row(2, "a"),
      Row(3, "c"),
      ]

      hudi_configs =

      { "hoodie.table.name": TABLE_NAME, "hoodie.datasource.write.recordkey.field": "name", "hoodie.datasource.write.precombine.field": "id", "hoodie.datasource.write.operation":"insert_overwrite_table", "hoodie.table.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator", }

      df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

      df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)

      spark.read.format("hudi").load(PATH).show()
      – Showing no records
      ```

      df.write.format("org.apache.hudi").options(**hudi_configs).option("hoodie.datasource.write.insert.drop.duplicates","true").mode("append").save(PATH)

      spark.read.format("hudi").load(PATH).show()

      Attachments

        Issue Links

          Activity

            People

              shivnarayan sivabalan narayanan
              adityagoenka Aditya Goenka
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: