Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.1.1
-
None
-
None
Description
This issue occurs, for example, when trying to create a data frame from Python datetime objects that are "out of range" where "out of range" is platform-dependent due to the use of time.mktime in TimestampType.toInternal:
import datetime
spark_session.createDataFrame([(datetime.datetime(9999, 12, 31, 0, 0),)])
A more direct way to reproduce the issue is by invoking TimestampType.toInternal directly:
import datetime from pyspark.sql.types import TimestampType dt = datetime.datetime(9999, 12, 31, 0, 0) TimestampType().toInternal(dt)
The suggested improvement is to avoid using time.mktime to increase the range of datetime values. A possible implementation may look as follows:
import datetime import pytz EPOCH_UTC = datetime.datetime(1970, 1, 1).replace(tzinfo=pytz.utc) LOCAL_TZ = datetime.datetime.now().astimezone().tzinfo def toInternal(dt): if dt is not None: dt = dt if dt.tzinfo else dt.replace(tzinfo=LOCAL_TZ) dt_utc = dt.astimezone(pytz.utc) td = dt_utc - EPOCH_UTC return (td.days * 86400 + td.seconds) * 10 ** 6 + td.microseconds
This relies on the ability to derive the local timezone. Other mechanisms may be used to what is suggested above.
Test cases include:
dt1 = datetime.datetime(2021, 5, 25, 12, 23) dt2 = dt1.replace(tzinfo=pytz.timezone('Europe/Zurich')) dt3 = datetime.datetime(9999, 12, 31, 0, 0) dt4 = dt3.replace(tzinfo=pytz.timezone('Europe/Zurich')) toInternal(dt1) == TimestampType().toInternal(dt1) toInternal(dt2) == TimestampType().toInternal(dt2) toInternal(dt3) # TimestampType().toInternal(dt3) fails toInternal(dt4) == TimestampType().toInternal(dt4)