[SPARK-19561] Pyspark Dataframes don't allow timestamps near epoch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.1, 2.1.0
Fix Version/s: 2.1.1, 2.2.0
Component/s: PySpark, SQL
Labels:
None

Description

Pyspark does not allow timestamps at or near the epoch to be created in a DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299

TimestampType.toInternal converts a datetime object to a number representing microseconds since the epoch. For all times more than 2148 seconds before or after 1970-01-01T00:00:00+0000, this number is greater than 2^31 and Py4J automatically serializes it as a long.

However, for times within this range (~35 minutes before or after the epoch), Py4J serializes it as an int. When creating the object on the Scala side, ints are not recognized and the value goes to null. This leads to null values in non-nullable fields, and corrupted Parquet files.

The solution is trivial - force TimestampType.toInternal to always return a long.

Attachments

Issue Links

links to

[Github] Pull Request #16896 (JasonMWhite)

[Github] Pull Request #17200 (JasonMWhite)

Activity

People

Assignee:: Jason White

Reporter:: Jason White

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Feb/17 21:37

Updated:: 08/Mar/17 01:23

Resolved:: 07/Mar/17 21:12