Details
-
Bug
-
Status: Reopened
-
Major
-
Resolution: Unresolved
-
2.4.7
-
None
-
None
Description
Inserted following data with UTC as both JVM and session timezone.
Spark-shell launch command
bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=UTC' --conf spark.executor.extraJavaOptions='-Duser.timezone=UTC'
Table creation
sql("use ts").show sql("create table spark_parquet(type string, t timestamp) stored as parquet").show sql("create table spark_orc(type string, t timestamp) stored as orc").show sql("create table spark_avro(type string, t timestamp) stored as avro").show sql("create table spark_text(type string, t timestamp) stored as textfile").show sql("insert into spark_parquet values ('FROM SPARK-EXT PARQUET', '1989-01-05 01:02:03')").show sql("insert into spark_orc values ('FROM SPARK-EXT ORC', '1989-01-05 01:02:03')").show sql("insert into spark_avro values ('FROM SPARK-EXT AVRO', '1989-01-05 01:02:03')").show sql("insert into spark_text values ('FROM SPARK-EXT TEXT', '1989-01-05 01:02:03')").show
Used following function to check and verify the returned timestamps
scala> :paste // Entering paste mode (ctrl-D to finish) def showTs( db: String, tables: String* ): org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = { sql("use " + db).show import scala.collection.mutable.ListBuffer var results = new ListBuffer[org.apache.spark.sql.DataFrame]() for (tbl <- tables) { val query = "select * from " + tbl println("Executing - " + query); results += sql(query) } println("user.timezone - " + System.getProperty("user.timezone")) println("TimeZone.getDefault - " + java.util.TimeZone.getDefault.getID) println("spark.sql.session.timeZone - " + spark.conf.get("spark.sql.session.timeZone")) var unionDf = results(0) for (i <- 1 until results.length) { unionDf = unionDf.unionAll(results(i)) } val augmented = unionDf.map(r => (r.getString(0), r.getTimestamp(1), r.getTimestamp(1).getTime)) val renamed = augmented.withColumnRenamed("_1", "type").withColumnRenamed("_2", "ts").withColumnRenamed("_3", "millis") renamed.show(false) return renamed } // Exiting paste mode, now interpreting. scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") Hive Session ID = daa82b83-b50d-4038-97ee-1ecb2d01b368 ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - UTC TimeZone.getDefault - UTC spark.sql.session.timeZone - UTC +----------------------+-------------------+------------+ |type |ts |millis | +----------------------+-------------------+------------+ |FROM SPARK-EXT PARQUET|1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT ORC |1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599965323000| |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599965323000| +----------------------+-------------------+------------+
1. Set session timezone to America/Los_Angeles
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - UTC TimeZone.getDefault - UTC spark.sql.session.timeZone - America/Los_Angeles +----------------------+-------------------+------------+ |type |ts |millis | +----------------------+-------------------+------------+ |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT ORC |1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT AVRO |1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT TEXT |1989-01-04 17:02:03|599965323000| +----------------------+-------------------+------------+
2. Started shell (JVM) in America/Los_Angeles timezone (which sets session timezone also to America/Los_Angeles)
bin/spark-shell --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.catalogImplementation=hive --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 --conf spark.driver.extraJavaOptions=' -Duser.timezone=America/Los_Angeles' --conf spark.executor.extraJavaOptions='-Duser.timezone=America/Los_Angeles' scala> showTs("ts", "spark_parquet", "spark_orc", "spark_avro", "spark_text") Hive Session ID = 10ff355c-318d-4cb8-870f-a388652133b1 ++ || ++ ++ Executing - select * from spark_parquet Executing - select * from spark_orc Executing - select * from spark_avro Executing - select * from spark_text user.timezone - America/Los_Angeles TimeZone.getDefault - America/Los_Angeles spark.sql.session.timeZone - America/Los_Angeles +----------------------+-------------------+------------+ |type |ts |millis | +----------------------+-------------------+------------+ |FROM SPARK-EXT PARQUET|1989-01-04 17:02:03|599965323000| |FROM SPARK-EXT ORC |1989-01-05 01:02:03|599994123000| |FROM SPARK-EXT AVRO |1989-01-05 01:02:03|599994123000| |FROM SPARK-EXT TEXT |1989-01-05 01:02:03|599994123000| +----------------------+-------------------+------------+
As we can see in 1 and 2, parquet and other formats are behaving differently.
In 1 - 1989-01-04 17:02:03|599965323000 seems correct according to America/Los_Angeles timezone as the original value inserted in UTC was 1989-01-05 01:02:03 which is equal to 599965323000 UTC
In 2 - only parquet seems to be correct while the other formats are producing 1989-01-05 01:02:03|599994123000 which should not be the case if we are using a different timezone ( America/Los_Angeles ). I think they are coming from individual file format readers (avro, orc, text) but then they don't produce a converted value in America/Los_Angeles timezone. I saw orc reader adjusting offset according to writer and reader(JVM) timezone(probably avro is doing the same) but then we are not seeing the end value in spark according to spark.sql.session.timeZone
Are there any guidelines/docs around how to use timezones with spark ?
Attachments
Issue Links
- is superceded by
-
SPARK-30951 Potential data loss for legacy applications after switch to proleptic Gregorian calendar
- Resolved
-
SPARK-31404 file source backward compatibility after calendar switch
- Resolved
-
SPARK-26651 Use Proleptic Gregorian calendar
- Resolved