Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-298

Upsert MOR table but got a NULL value

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: None

      Description

      I ran into this issue when I tried to update a record in a MOR table.

      I found that this issue depends on the order of columns. If the partition key is the last column, then it will work normally. But if the partition key is not the last column, then it will return NULL for the updated field.

      Here is a code example:

      import org.apache.hudi.DataSourceWriteOptions
      import org.apache.hudi.config.HoodieWriteConfig
      import org.apache.spark.sql.SaveModeval 
      
      df = Seq(
        ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
        ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
        ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
        ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
        ).toDF("event_id", "event_name", "event_ts", "event_type")
      
      var tableName = "hudi_events_mor_1"
      var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
      
      // write hudi dataset
      df.write.format("org.apache.hudi")
        .option(HoodieWriteConfig.TABLE_NAME, tableName)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
        .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
        .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
        .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
        .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
        .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
        .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
        .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
        .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
        .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
        .mode(SaveMode.Overwrite)
        .save(tablePath)
      
      // update a record with event_name "event_name_123" => "event_name_changed"
      val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
      val df2 = df1.filter($"event_id" === "104")
      val df3 = df2.withColumn("event_name", lit("event_name_changed"))
      
      // update hudi dataset
      df3.write.format("org.apache.hudi")
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
         .option("hoodie.compact.inline", "false")
         .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
         .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
         .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
         .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
         .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")
         .mode(SaveMode.Append)
         .save(tablePath)
      

       

      Then I looked into the hive table, the event name is updated in the _rt table as expected:

      hive> select * from hudi_events_mor_1 where event_id="104";
      OK
      20191008184508	20191008184508_1_4	104	type1	ec3761e6-22eb-47f2-b87d-d47484ff8d5f-0_1-4-8_20191008184508.parquet	104	event_name_123	2015-01-01T12:15:00.512679Z	type1
      
      hive> select * from hudi_events_mor_1_rt where event_id="104";
      OK
      20191008184558	20191008184558_0_1	104	type1	ec3761e6-22eb-47f2-b87d-d47484ff8d5f-0	104	event_name_changed	2015-01-01T12:15:00.512679Z	type1
      

       

      But if I do the exact same thing for another dataframe with different column order:

      val df = Seq(
        ("100", "type1", "event_name_900", "2015-01-01T13:51:39.340396Z"),
        ("101", "type2", "event_name_546", "2015-01-01T12:14:58.597216Z"),
        ("104", "type1", "event_name_123", "2015-01-01T12:15:00.512679Z"),
        ("105", "type2", "event_name_678", "2015-01-01T13:51:42.248818Z")
        ).toDF("event_id", "event_type", "event_name", "event_ts")

      And I also selected event_type as my partition key (this time the partition key column is not the last column)

      Then I looked into the hive table:

      hive> select * from hudi_events_mor_2 where event_id="104";
      OK
      20191008185116	20191008185116_1_2	104	type1	23e3c091-f414-4fce-a51f-70cc012d7af2-0_1-4-13_20191008185116.parquet	104	event_name_123	2015-01-01T12:15:00.512679Z	type1
      
      hive> select * from hudi_events_mor_2_rt where event_id="104";
      OK
      20191008185240	20191008185240_0_1	104	type1	23e3c091-f414-4fce-a51f-70cc012d7af2-0	104	NULL	event_name_changed	type1
      

      You can see the NULL value in the _rt table and also the order of column is messed up.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                uditme Udit Mehrotra
                Reporter:
                wenningd Wenning Ding
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m