[HUDI-2390] KeyGenerator discrepancy between DataFrame writer and SQL - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Convert to Issue

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 0.10.0
Component/s: spark
Labels:
- sev:critical
- user-support-issues

Description

Test Case:

 import org.apache.hudi.QuickstartUtils._
 import scala.collection.JavaConversions._
 import org.apache.spark.sql.SaveMode._
 import org.apache.hudi.DataSourceReadOptions._
 import org.apache.hudi.DataSourceWriteOptions._
 import org.apache.hudi.config.HoodieWriteConfig._

1.准备数据

spark.sql("create table test1(a int,b string,c string) using hudi partitioned by(b) options(primaryKey='a')")
spark.sql("insert into table test1 select 1,2,3")

2.创建hudi table test2

spark.sql("create table test2(a int,b string,c string) using hudi partitioned by(b) options(primaryKey='a')")

3.datasource向test2写入数据

val base_data=spark.sql("select * from testdb.test1")
base_data.write.format("hudi").
option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL).      
option(RECORDKEY_FIELD_OPT_KEY, "a").      
option(PARTITIONPATH_FIELD_OPT_KEY, "b").      
option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator"). 
option(OPERATION_OPT_KEY, "bulk_insert").      
option(HIVE_SYNC_ENABLED_OPT_KEY, "true").      
option(HIVE_PARTITION_FIELDS_OPT_KEY, "b").   
option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor").      
option(HIVE_DATABASE_OPT_KEY, "testdb").      
option(HIVE_TABLE_OPT_KEY, "test2").      
option(HIVE_USE_JDBC_OPT_KEY, "true").      
option("hoodie.bulkinsert.shuffle.parallelism", 4).
option("hoodie.datasource.write.hive_style_partitioning", "true").      
option(TABLE_NAME, "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2")

此时执行查询结果如下：

+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 2|
+---+---+---+

4.删除一条记录

spark.sql("delete from testdb.test2 where a=1")

5.执行查询，a=1的记录未被删除

spark.sql("select a,b,c from testdb.test2").show

+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 2|
+---+---+---+

Attachments

Issue Links

Add Link

is related to

HUDI-2495 Difference in behavior between GenericRecord based key gen and Row based key gen

Closed

Delete this link

HUDI-2500 Spark datasource delete not working on Spark SQL created table

Closed

Delete this link

relates to

HUDI-2538 Persist configs to hoodie.properties on the first write

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Yann Byron Assign to me

Reporter:: renhao

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Sep/21 01:52

Updated:: 08/Nov/21 07:18

Resolved:: 08/Nov/21 07:18

Agile

View on Board

KeyGenerator discrepancy between DataFrame writer and SQL

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment