tl;dr: We recently discovered some Spark 2.x sites that have lots of data containing dates before October 15, 1582. This could be an issue when such sites try to upgrade to Spark 3.0.
"The changes might impact on the results for dates and timestamps before October 15, 1582 (Gregorian)
We recently discovered that some large scale Spark 2.x applications rely on dates before October 15, 1582.
Two cases came up recently:
- An application that uses a commercial third-party library to encode sensitive dates. On insert, the library encodes the actual date as some other date. On select, the library decodes the date back to the original date. The encoded value could be any date, including one before October 15, 1582 (e.g., "0602-04-04").
- An application that uses a specific unlikely date (e.g., "1200-01-01") as a marker to indicate "unknown date" (in lieu of null)
Both sites ran into problems after another component in their system was upgraded to use the proleptic Gregorian calendar. Spark applications that read files created by the upgraded component were interpreting encoded or marker dates incorrectly, and vice versa. Also, their data now had a mix of calendars (hybrid and proleptic Gregorian) with no metadata to indicate which file used which calendar.
Both sites had enormous amounts of existing data, so re-encoding the dates using some other scheme was not a feasible solution.
This is relevant to Spark 3:
Any Spark 2 application that uses such date-encoding schemes may run into trouble when run on Spark 3. The application may not properly interpret the dates previously written by Spark 2. Also, once the Spark 3 version of the application writes data, the tables will have a mix of calendars (hybrid and proleptic gregorian) with no metadata to indicate which file uses which calendar.
Similarly, sites might run with mixed Spark versions, resulting in data written by one version that cannot be interpreted by the other. And as above, the tables will now have a mix of calendars with no way to detect which file uses which calendar.
As with the two real-life example cases, these applications may have enormous amounts of legacy data, so re-encoding the dates using some other scheme may not be feasible.
We might want to consider a configuration setting to allow the user to specify the calendar for storing and retrieving date and timestamp values (not sure how such a flag would affect other date and timestamp-related functions). I realize the change is far bigger than just adding a configuration setting.
Here's a quick example of where trouble may happen, using the real-life case of the marker date.
In Spark 2.4:
In Spark 3.0 (reading from the same legacy file):
By the way, Hive had a similar problem. Hive switched from hybrid calendar to proleptic Gregorian calendar between 2.x and 3.x. After some upgrade headaches related to dates before 1582, the Hive community made the following changes:
- When writing date or timestamp data to ORC, Parquet, and Avro files, Hive checks a configuration setting to determine which calendar to use.
- When writing date or timestamp data to ORC, Parquet, and Avro files, Hive stores the calendar type in the metadata.
- When reading date or timestamp data from ORC, Parquet, and Avro files, Hive checks the metadata for the calendar type.
- When reading date or timestamp data from ORC, Parquet, and Avro files that lack calendar metadata, Hive's behavior is determined by a configuration setting. This allows Hive to read legacy data (note: if the data already consists of a mix of calendar types with no metadata, there is no good solution).