Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25873

Date corruption when Spark and Hive both are on different timezones

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.1
    • None
    • None

    Description

      There is date alteration when loading date from one table to another in hive through spark. This happens when Hive is on a remote machine with timezone different than the one on which Spark is running. This happens only when the Source table format is 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

      Below are the steps to produce the issue:

      1. Create two tables as below in hive which has a timezone, say in, EST

       CREATE TABLE t_src(
       name varchar(10),
       dob timestamp
       )
       ROW FORMAT SERDE 
       'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
       STORED AS INPUTFORMAT 
       'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
       OUTPUTFORMAT 
       'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
      
      INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
      
       CREATE TABLE t_tgt(
       name varchar(10),
       dob timestamp
       );
      

      2. Copy hive-site.xml to spark-2.2.1-bin-hadoop2.7/conf folder, so that when you create sqlContext for hive it connects to your remote hive server.

      3. Start your spark-shell on some other machine whose timezone is different than that of Hive, say, PDT

      4. Execute below code:

      import org.apache.spark.sql.hive.HiveContext
      val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
      
      val q0 = "TRUNCATE table t_tgt"
      val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src alias"
      val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1 FROM tbl0"
      
      sqlContext.sql(q0)
      sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
      sqlContext.sql(q2)
      

      5. Now navigate to hive and check the contents of the TARGET table (t_tgt). The dob field will have incorrect values.

       

      Is this a known issue? Is there any work around on this? Can it be fixed?

       

      Thanks & regards,

      Pawan Lawale

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              pawanlawale Pawan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: