[SPARK-34675] TimeZone inconsistencies when JVM and session timezones are different - ASF JIRA

Details

Type: Bug
Status: Reopened
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.7
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

Inserted following data with UTC as both JVM and session timezone.

Spark-shell launch command

bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC'

Table creation

sql("use ts").show
sql("create table spark_parquet(type string, t timestamp) stored as parquet").show
sql("create table spark_orc(type string, t timestamp) stored as orc").show
sql("create table spark_avro(type string, t timestamp) stored as avro").show
sql("create table spark_text(type string, t timestamp) stored as textfile").show
sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 01:02:03')").show
sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 01:02:03')").show
sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 01:02:03')").show
sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 01:02:03')").show

Used following function to check and verify the returned timestamps

scala> :paste
// Entering paste mode (ctrl-D to finish)

def showTs(
    db: String,
    tables: String*
): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = {
  sql("use " + db).show
  import scala.collection.mutable.ListBuffer
  var results = new ListBuffer[org.apache.spark.sql.DataFrame]()
  for (tbl <- tables) {
    val query = "select * from " + tbl
    println("Executing - " + query);
    results += sql(query)
  }
  println("user.timezone - " + System.getProperty("user.timezone"))
  println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID)
  println("spark.sql.session.timeZone - " + spark.conf.get("spark.sql.session.timeZone"))
  var unionDf = results(0)
  for (i <- 1 until results.length) {
    unionDf = unionDf.unionAll(results(i))
  }
  val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), r.getTimestamp(1).getTime))
  val renamed = augmented.withColumnRenamed("_1", "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis")
renamed.show(false)
  return renamed
}

// Exiting paste mode, now interpreting.

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - UTC
TimeZone.getDefault - UTC
spark.sql.session.timeZone - UTC
+----------------------+-------------------+------------+                       
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599965323000|
+----------------------+-------------------+------------+

1. Set session timezone to America/Los_Angeles


scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - UTC
TimeZone.getDefault - UTC
spark.sql.session.timeZone - America/Los_Angeles
+----------------------+-------------------+------------+
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT AVRO   |1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT TEXT   |1989-01-04 17:02:03|599965323000|
+----------------------+-------------------+------------+

2. Started shell (JVM) in America/Los_Angeles timezone (which sets session timezone also to America/Los_Angeles)

bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles'

scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text")
Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1
++
||
++
++

Executing - select * from spark_parquet
Executing - select * from spark_orc
Executing - select * from spark_avro
Executing - select * from spark_text
user.timezone - America/Los_Angeles
TimeZone.getDefault - America/Los_Angeles
spark.sql.session.timeZone - America/Los_Angeles
+----------------------+-------------------+------------+                       
|type                  |ts                 |millis      |
+----------------------+-------------------+------------+
|FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000|
|FROM SPARK-EXT ORC    |1989-01-05 01:02:03|599994123000|
|FROM SPARK-EXT AVRO   |1989-01-05 01:02:03|599994123000|
|FROM SPARK-EXT TEXT   |1989-01-05 01:02:03|599994123000|
+----------------------+-------------------+------------+

As we can see in 1 and 2, parquet and other formats are behaving differently.

In 1 - 1989-01-04 17:02:03|599965323000 seems correct according to America/Los_Angeles timezone as the original value inserted in UTC was 1989-01-05 01:02:03 which is equal to 599965323000 UTC

In 2 - only parquet seems to be correct while the other formats are producing 1989-01-05 01:02:03|599994123000 which should not be the case if we are using a different timezone ( America/Los_Angeles ). I think they are coming from individual file format readers (avro, orc, text) but then they don't produce a converted value in America/Los_Angeles timezone. I saw orc reader adjusting offset according to writer and reader(JVM) timezone(probably avro is doing the same) but then we are not seeing the end value in spark according to spark.sql.session.timeZone

Are there any guidelines/docs around how to use timezones with spark ?

cc cloud_fan hyukjin.kwon dongjoon

Attachments

Issue Links

is superceded by

SPARK-30951 Potential data loss for legacy applications after switch to proleptic Gregorian calendar

Resolved

SPARK-31404 file source backward compatibility after calendar switch

Resolved

SPARK-26651 Use Proleptic Gregorian calendar

Resolved

TimeZone inconsistencies when JVM and session timezones are different

Details

Description

Attachments

Issue Links

Activity

People

Dates