Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25919

Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.0, 2.2.1
    • Fix Version/s: None
    • Labels:
      None

      Description

      Hi

      I found a really strange issue. Below are the steps to reproduce it. This issue occurs only when the table row format is ParquetHiveSerDe and the target table is Partitioned

      Hive:

      Login in to hive terminal on cluster and create below tables.

      create table t_src(
      name varchar(10),
      dob timestamp
      )
      ROW FORMAT SERDE 
       'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
      STORED AS INPUTFORMAT 
       'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
      OUTPUTFORMAT 
       'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
      create table t_tgt(
      name varchar(10),
      dob timestamp
      )
      PARTITIONED BY (city varchar(10))
      ROW FORMAT SERDE 
       'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
      STORED AS INPUTFORMAT 
       'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
      OUTPUTFORMAT 
       'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
      
      
      

      Insert data into the source table (t_src)

      INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');

      Spark-shell:

      Get on to spark-shell. 

      Execute below commands on spark shell:

      import org.apache.spark.sql.hive.HiveContext
      val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
      val q0 = "TRUNCATE table t_tgt"
      val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM DEFAULT.t_src alias"
      val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
      sqlContext.sql(q0)
      sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
      sqlContext.sql(q2)
      
      
      

       After this check the contents of target table t_tgt. You will see the date "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows the contents of both the tables:

      select * from t_src;
      +-------------+------------------------+--+
      | t_src.name | t_src.dob |
      +-------------+------------------------+--+
      | p1 | 0001-01-01 00:00:00.0 |
      | p2 | 0002-01-01 00:00:00.0 |
      | p3 | 0003-01-01 00:00:00.0 |
      | p4 | 0004-01-01 00:00:00.0 |
      +-------------+------------------------+–+
       select * from t_tgt;
      +-------------+------------------------+--+
      | t_src.name | t_src.dob | t_tgt.city |
      +-------------+------------------------+--+
      | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
      | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
      | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
      | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
      +-------------+------------------------+--+
      

       

      Is this a known issue? Is it fixed in any subsequent releases?

      Thanks & regards,

      Pawan Lawale

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                pawanlawale Pawan
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: