Status: Resolved
Resolution: Not A Problem
There is an issue with how timestamps are displayed/converted to Strings in Spark SQL. The documentation states that the timestamp should be created in the GMT time zone, however, if we do so, we see that the output actually contains a -8 hour offset:
new Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT]").toInstant.toEpochMilli) res144: java.sql.Timestamp = 2014-12-31 16:00:00.0 new Timestamp(ZonedDateTime.parse("2015-01-01T00:00:00Z[GMT-08:00]").toInstant.toEpochMilli) res145: java.sql.Timestamp = 2015-01-01 00:00:00.0
This result is confusing, unintuitive, and introduces issues when converting from DataFrames containing timestamps to RDDs which are then saved as text. This has the effect of essentially shifting all dates in a dataset by 1 day.
The suggested fix for this is to update the timestamp toString representation to either a) Include timezone or b) Correctly display in GMT.
This change may well introduce substantial and insidious bugs so I'm not sure how best to resolve this.