Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-8721 Bridging Hudi Spark SQL behavior gaps - Phase 0
  3. HUDI-8628

Merge Into is pulling in additional fields which are not set as per the condition

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • None
    • 1.0.1
    • spark-sql
    • None

    Description

      spark.sql(s"set ${HoodieWriteConfig.MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
      spark.sql(s"set ${DataSourceWriteOptions.ENABLE_MERGE_INTO_PARTIAL_UPDATES.key} = true")
      spark.sql(s"set ${HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key} = $logDataBlockFormat")
      spark.sql(s"set ${HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key} = false")

      spark.sql(
      s"""

      create table $tableName (
      id int,
      name string,
      price long,
      _ts long,
      description string
      ) using hudi
      tblproperties(
      type ='$tableType',
      primaryKey = 'id',
      preCombineField = '_ts'
      )
      location '$basePath'
      """.stripMargin)
      spark.sql(s"insert into $tableName values (1, 'a1', 10, 1000, 'a1: desc1')," +
      "(2, 'a2', 20, 1200, 'a2: desc2'), (3, 'a3', 30.0, 1250, 'a3: desc3')")

       

       

      Merge Into:

      // Partial updates using MERGE INTO statement with changed fields: "price" and "_ts"
      spark.sql(
      s"""

      merge into $tableName t0
      using ( select 1 as id, 'a1' as name, 12 as price, 1001 as _ts
      union select 3 as id, 'a3' as name, 25 as price, 1260 as _ts) s0
      on t0.id = s0.id
      when matched then update set price = s0.price, _ts = s0._ts
      """.stripMargin)

       

      The schema for this merge into command when we reach HoodieSparkSqlWriter.deduceWriterSchema is given below. 

      i.e. 

      val writerSchema = HoodieSchemaUtils.deduceWriterSchema(sourceSchema, latestTableSchemaOpt, internalSchemaOpt, parameters)

       

       

      the merge into command only instructs to update price and _ts right? So, why other fields are also picked up from source(for eg name). 

      You can check out the test in TestPartialUpdateForMergeInto.Test partial update with MOR and Avro log format

       

      Note: This is partial update support w/ MergeInto btw, not a regular MergeInto.

       

       

      Attachments

        1. image-2024-12-02-03-58-48-178.png
          97 kB
          sivabalan narayanan

        Activity

          People

            Unassigned Unassigned
            shivnarayan sivabalan narayanan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: