[SPARK-41982] When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1` - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: Spark Core
Labels:
None

Description

At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully read the migration documentwe and found a kind of situation not involved:

create table if not exists test_90(a string, b string) partitioned by (dt string);
desc formatted test_90;
// case1
insert into table test_90 partition (dt=05) values("1","2");
// case2
insert into table test_90 partition (dt='05') values("1","2");
drop table test_90;

in spark2.4.3, it will generate such a path:

// the path
hdfs://test5/user/hive/db1/test_90/dt=05 

//result
spark-sql> select * from test_90;
1       2       05
1       2       05
Time taken: 1.316 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90; 
dt=05 
Time taken: 0.201 seconds, Fetched 1 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
1       2       05
Time taken: 0.212 seconds, Fetched 2 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
== Physical Plan ==
Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, [a, b]
+- LocalTableScan [a#116, b#117]
Time taken: 1.145 seconds, Fetched 1 row(s)

in spark3.2.0, it will generate two path:

// the path
hdfs://test5/user/hive/db1/test_90/dt=05 
hdfs://test5/user/hive/db1/test_90/dt=5 

// result
spark-sql> select * from test_90;
1       2       05
1       2       5
Time taken: 2.119 seconds, Fetched 2 row(s)

spark-sql> show partitions test_90;
dt=05
dt=5
Time taken: 0.161 seconds, Fetched 2 row(s)

spark-sql> select * from test_90 where dt='05';
1       2       05
Time taken: 0.252 seconds, Fetched 1 row(s)

spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
plan
== Physical Plan ==
Execute InsertIntoHiveTable `db1`.`test_90`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
+- LocalTableScan [a#109, b#110]

This will cause problems in reading data after the user switches to spark3. The root cause is that in the process of partition field resolution, Spark3 has a process of strongly converting this string type, which will cause partition `05` to lose the previous `0`

So I think we have two solutions:

one is to record the risk clearly in the migration document, and the other is to repair this case, because we internally keep the partition of string type as string type, regardless of whether single or double quotation marks are added.

Attachments

Issue Links

links to

[Github] Pull Request #39558 (smallzhongfeng)

Activity

People

Assignee:: jingxiong zhong

Reporter:: jingxiong zhong

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Jan/23 09:12

Updated:: 17/Jan/23 04:47

Resolved:: 17/Jan/23 04:47