Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34675

TimeZone inconsistencies when JVM and session timezones are different

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 2.4.7
    • None
    • Spark Core
    • None

    Description

      Inserted following data with UTC as both JVM and session timezone.

      Spark-shell launch command

      bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC'
      

      Table creation

      sql("use ts").show
      sql("create table spark_parquet(type string, t timestamp) stored as parquet").show
      sql("create table spark_orc(type string, t timestamp) stored as orc").show
      sql("create table spark_avro(type string, t timestamp) stored as avro").show
      sql("create table spark_text(type string, t timestamp) stored as textfile").show
      sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 01:02:03')").show
      sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 01:02:03')").show
      sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 01:02:03')").show
      sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 01:02:03')").show
      

      Used following function to check and verify the returned timestamps

      scala> :paste
      // Entering paste mode (ctrl-D to finish)
      
      def showTs(
          db: String,
          tables: String*
      ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
        sql("use " + db).show
        import scala.collection.mutable.ListBuffer
        var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
        for (tbl <- tables) {
          val query = "select * from " + tbl
          println("Executing - " + query);
          results += sql(query)
        }
        println("user.timezone - " + System.getProperty("user.timezone"))
        println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
        println("spark.sql.session.timeZone - " + spark.conf.get("spark.sql.session.timeZone"))
        var unionDf = results(0)
        for (i <- 1 until results.length) {
          unionDf = unionDf.unionAll(results(i))
        }
        val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), r.getTimestamp(1).getTime))
        val renamed = augmented.withColumnRenamed("_1", "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
      renamed.show(false)
        return renamed
      }
      
      // Exiting paste mode, now interpreting.
      
      scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
      Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
      ++
      ||
      ++
      ++
      
      Executing - select * from spark_parquet
      Executing - select * from spark_orc
      Executing - select * from spark_avro
      Executing - select * from spark_text
      user.timezone - UTC
      TimeZone.getDefault - UTC
      spark.sql.session.timeZone - UTC
      +----------------------+-------------------+------------+                       
      |type                  |ts                 |millis      |
      +----------------------+-------------------+------------+
      |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
      |FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599965323000|
      |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
      |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
      +----------------------+-------------------+------------+
      

      1. Set session timezone to America/Los_Angeles

      
      scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
      
      scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
      ++
      ||
      ++
      ++
      
      Executing - select * from spark_parquet
      Executing - select * from spark_orc
      Executing - select * from spark_avro
      Executing - select * from spark_text
      user.timezone - UTC
      TimeZone.getDefault - UTC
      spark.sql.session.timeZone - America/Los_Angeles
      +----------------------+-------------------+------------+
      |type                  |ts                 |millis      |
      +----------------------+-------------------+------------+
      |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
      |FROM SPARK-EXT ORC    |1989-01-04 17:02:03|599965323000|
      |FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
      |FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
      +----------------------+-------------------+------------+
      

      2. Started shell (JVM) in America/Los_Angeles timezone (which sets session timezone also to America/Los_Angeles)

      bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'
      
      scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
      Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1
      ++
      ||
      ++
      ++
      
      Executing - select * from spark_parquet
      Executing - select * from spark_orc
      Executing - select * from spark_avro
      Executing - select * from spark_text
      user.timezone - America/Los_Angeles
      TimeZone.getDefault - America/Los_Angeles
      spark.sql.session.timeZone - America/Los_Angeles
      +----------------------+-------------------+------------+                       
      |type                  |ts                 |millis      |
      +----------------------+-------------------+------------+
      |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
      |FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599994123000|
      |FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599994123000|
      |FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599994123000|
      +----------------------+-------------------+------------+
      

      As we can see in 1 and 2, parquet and other formats are behaving differently.

      In 1 - 1989-01-04 17:02:03|599965323000 seems correct according to America/Los_Angeles timezone as the original value inserted in UTC was 1989-01-05 01:02:03 which is equal to 599965323000 UTC

      In 2 - only parquet seems to be correct while the other formats are producing 1989-01-05 01:02:03|599994123000 which should not be the case if we are using a different timezone ( America/Los_Angeles ). I think they are coming from individual file format readers (avro, orc, text) but then they don't produce a converted value in America/Los_Angeles timezone. I saw orc reader adjusting offset according to writer and reader(JVM) timezone(probably avro is doing the same) but then we are not seeing the end value in spark according to spark.sql.session.timeZone

      Are there any guidelines/docs around how to use timezones with spark ?

      cc cloud_fan hyukjin.kwon dongjoon

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ShubhamChaurasia Shubham Chaurasia
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: