Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41982

When the inserted partition type is of string type, similar `dt=01` will be converted to `dt=1`

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • Spark Core
    • None

    Description

      At present, during the process of upgrading Spark2.4 to Spark3.2, we carefully read the migration documentwe and found a kind of situation not involved:

      create table if not exists test_90(a string, b string) partitioned by (dt string);
      desc formatted test_90;
      // case1
      insert into table test_90 partition (dt=05) values("1","2");
      // case2
      insert into table test_90 partition (dt='05') values("1","2");
      drop table test_90;

      in spark2.4.3, it will generate such a path:

      // the path
      hdfs://test5/user/hive/db1/test_90/dt=05 
      
      //result
      spark-sql> select * from test_90;
      1       2       05
      1       2       05
      Time taken: 1.316 seconds, Fetched 2 row(s)
      
      spark-sql> show partitions test_90; 
      dt=05 
      Time taken: 0.201 seconds, Fetched 1 row(s)
      
      spark-sql> select * from test_90 where dt='05';
      1       2       05
      1       2       05
      Time taken: 0.212 seconds, Fetched 2 row(s)
      
      spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
      == Physical Plan ==
      Execute InsertIntoHiveTable InsertIntoHiveTable `db1`.`test_90`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, Map(dt -> Some(05)), false, false, [a, b]
      +- LocalTableScan [a#116, b#117]
      Time taken: 1.145 seconds, Fetched 1 row(s)

      in spark3.2.0, it will generate two path:

      // the path
      hdfs://test5/user/hive/db1/test_90/dt=05 
      hdfs://test5/user/hive/db1/test_90/dt=5 
      
      // result
      spark-sql> select * from test_90;
      1       2       05
      1       2       5
      Time taken: 2.119 seconds, Fetched 2 row(s)
      
      spark-sql> show partitions test_90;
      dt=05
      dt=5
      Time taken: 0.161 seconds, Fetched 2 row(s)
      
      spark-sql> select * from test_90 where dt='05';
      1       2       05
      Time taken: 0.252 seconds, Fetched 1 row(s)
      
      spark-sql> explain insert into table test_90 partition (dt=05) values("1","2");
      plan
      == Physical Plan ==
      Execute InsertIntoHiveTable `db1`.`test_90`, org.apache.hadoop.hive.ql.io.orc.OrcSerde, [dt=Some(5)], false, false, [a, b]
      +- LocalTableScan [a#109, b#110]

      This will cause problems in reading data after the user switches to spark3. The root cause is that in the process of partition field resolution, Spark3 has a process of strongly converting this string type, which will cause partition `05` to lose the previous `0`

      So I think we have two solutions:

      one is to record the risk clearly in the migration document, and the other is to repair this case, because we internally keep the partition of string type as string type, regardless of whether single or double quotation marks are added.

       

       

      Attachments

        Activity

          People

            zhongjingxiong jingxiong zhong
            zhongjingxiong jingxiong zhong
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: