[SPARK-30951] Potential data loss for legacy applications after switch to proleptic Gregorian calendar - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: SQL
Labels:
- release-notes

Description

tl;dr: We recently discovered some Spark 2.x sites that have lots of data containing dates before October 15, 1582. This could be an issue when such sites try to upgrade to Spark 3.0.

From ~~SPARK-26651~~:

"The changes might impact on the results for dates and timestamps before October 15, 1582 (Gregorian)

We recently discovered that some large scale Spark 2.x applications rely on dates before October 15, 1582.

Two cases came up recently:

An application that uses a commercial third-party library to encode sensitive dates. On insert, the library encodes the actual date as some other date. On select, the library decodes the date back to the original date. The encoded value could be any date, including one before October 15, 1582 (e.g., "0602-04-04").
An application that uses a specific unlikely date (e.g., "1200-01-01") as a marker to indicate "unknown date" (in lieu of null)

Both sites ran into problems after another component in their system was upgraded to use the proleptic Gregorian calendar. Spark applications that read files created by the upgraded component were interpreting encoded or marker dates incorrectly, and vice versa. Also, their data now had a mix of calendars (hybrid and proleptic Gregorian) with no metadata to indicate which file used which calendar.

Both sites had enormous amounts of existing data, so re-encoding the dates using some other scheme was not a feasible solution.

This is relevant to Spark 3:

Any Spark 2 application that uses such date-encoding schemes may run into trouble when run on Spark 3. The application may not properly interpret the dates previously written by Spark 2. Also, once the Spark 3 version of the application writes data, the tables will have a mix of calendars (hybrid and proleptic gregorian) with no metadata to indicate which file uses which calendar.

Similarly, sites might run with mixed Spark versions, resulting in data written by one version that cannot be interpreted by the other. And as above, the tables will now have a mix of calendars with no way to detect which file uses which calendar.

As with the two real-life example cases, these applications may have enormous amounts of legacy data, so re-encoding the dates using some other scheme may not be feasible.

We might want to consider a configuration setting to allow the user to specify the calendar for storing and retrieving date and timestamp values (not sure how such a flag would affect other date and timestamp-related functions). I realize the change is far bigger than just adding a configuration setting.

Here's a quick example of where trouble may happen, using the real-life case of the marker date.

In Spark 2.4:

scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count
res0: Long = 1
scala>

In Spark 3.0 (reading from the same legacy file):

scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count
res0: Long = 0
scala>

By the way, Hive had a similar problem. Hive switched from hybrid calendar to proleptic Gregorian calendar between 2.x and 3.x. After some upgrade headaches related to dates before 1582, the Hive community made the following changes:

When writing date or timestamp data to ORC, Parquet, and Avro files, Hive checks a configuration setting to determine which calendar to use.
When writing date or timestamp data to ORC, Parquet, and Avro files, Hive stores the calendar type in the metadata.
When reading date or timestamp data from ORC, Parquet, and Avro files, Hive checks the metadata for the calendar type.
When reading date or timestamp data from ORC, Parquet, and Avro files that lack calendar metadata, Hive's behavior is determined by a configuration setting. This allows Hive to read legacy data (note: if the data already consists of a mix of calendar types with no metadata, there is no good solution).

Attachments

Issue Links

Add Link

is related to

SPARK-31404 file source backward compatibility after calendar switch

Resolved

Delete this link

SPARK-26651 Use Proleptic Gregorian calendar

Resolved

Delete this link

supercedes

SPARK-34675 TimeZone inconsistencies when JVM and session timezones are different

Reopened

Delete this link

Sub-Tasks

Create Sub-Task

1.	Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time	Resolved	Max Gekk	Actions
2.	Incompatible Avro dates/timestamps with Spark 2.4	Resolved	Max Gekk	Actions
3.	Incompatible Parquet dates/timestamps with Spark 2.4	Resolved	Max Gekk	Actions
4.	Incompatible ORC dates with Spark 2.4	Resolved	Max Gekk	Actions
5.	Benchmark date-time rebasing in Parquet datasource	Resolved	Max Gekk	Actions
6.	Benchmark date-time rebasing in ORC datasource	Resolved	Max Gekk	Actions
7.	Split Parquet/Avro configs for rebasing dates/timestamps in read and in write	Resolved	Max Gekk	Actions
8.	Incorrect timestamps rebasing on autumn daylight saving time	Resolved	Max Gekk	Actions
9.	Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582	Resolved	Unassigned	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Max Gekk

Reporter:: Bruce Robbins

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Feb/20 01:43

Updated:: 12/Dec/22 18:10

Resolved:: 20/Mar/20 07:33

Agile

View on Board

Potential data loss for legacy applications after switch to proleptic Gregorian calendar

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment