Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
3.4.0
-
None
Description
At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully read the migration documentwe and found a kind of situation not involved:
create table if not exists test_90(a string, b string) partitioned by (dt string); desc formatted test_90; // case1 insert into table test_90 partition (dt=05) values("1","2"); // case2 insert into table test_90 partition (dt='05') values("1","2"); drop table test_90;
in spark2.4.3, it will generate such a path:
// the path hdfs://test5/user/hive/db1/test_90/dt=05 //result spark-sql> select * from test_90; 1 2 05 1 2 05 Time taken: 1.316 seconds, Fetched 2 row(s) spark-sql> show partitions test_90; dt=05 Time taken: 0.201 seconds, Fetched 1 row(s) spark-sql> select * from test_90 where dt='05'; 1 2 05 1 2 05 Time taken: 0.212 seconds, Fetched 2 row(s) spark-sql> explain insert into table test_90 partition (dt=05) values("1","2"); == Physical Plan == Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, [a, b] +- LocalTableScan [a#116, b#117] Time taken: 1.145 seconds, Fetched 1 row(s)
in spark3.2.0, it will generate two path:
// the path hdfs://test5/user/hive/db1/test_90/dt=05 hdfs://test5/user/hive/db1/test_90/dt=5 // result spark-sql> select * from test_90; 1 2 05 1 2 5 Time taken: 2.119 seconds, Fetched 2 row(s) spark-sql> show partitions test_90; dt=05 dt=5 Time taken: 0.161 seconds, Fetched 2 row(s) spark-sql> select * from test_90 where dt='05'; 1 2 05 Time taken: 0.252 seconds, Fetched 1 row(s) spark-sql> explain insert into table test_90 partition (dt=05) values("1","2"); plan == Physical Plan == Execute InsertIntoHiveTable `db1`.`test_90`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b] +- LocalTableScan [a#109, b#110]
This will cause problems in reading data after the user switches to spark3. The root cause is that in the process of partition field resolution, Spark3 has a process of strongly converting this string type, which will cause partition `05` to lose the previous `0`
So I think we have two solutions:
one is to record the risk clearly in the migration document, and the other is to repair this case, because we internally keep the partition of string type as string type, regardless of whether single or double quotation marks are added.