Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35515

TimestampType: OverflowError: mktime argument out of range

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.1.1
    • None
    • PySpark
    • None

    Description

      This issue occurs, for example, when trying to create a data frame from Python datetime objects that are "out of range" where "out of range" is platform-dependent due to the use of time.mktime in TimestampType.toInternal:

      import datetime
      spark_session.createDataFrame([(datetime.datetime(9999, 12, 31, 0, 0),)])
      

      A more direct way to reproduce the issue is by invoking TimestampType.toInternal directly:

      import datetime
      from pyspark.sql.types import TimestampType
      dt = datetime.datetime(9999, 12, 31, 0, 0)
      TimestampType().toInternal(dt)
      

      The suggested improvement is to avoid using time.mktime to increase the range of datetime values. A possible implementation may look as follows:

      import datetime
      import pytz
      
      EPOCH_UTC = datetime.datetime(1970, 1, 1).replace(tzinfo=pytz.utc)
      LOCAL_TZ = datetime.datetime.now().astimezone().tzinfo
      
      def toInternal(dt):
      	if dt is not None:
      		dt = dt if dt.tzinfo else dt.replace(tzinfo=LOCAL_TZ)
      		dt_utc = dt.astimezone(pytz.utc)
      		td = dt_utc - EPOCH_UTC
      		return (td.days * 86400 + td.seconds) * 10 ** 6 + td.microseconds
      

      This relies on the ability to derive the local timezone. Other mechanisms may be used to what is suggested above.

      Test cases include:

      dt1 = datetime.datetime(2021, 5, 25, 12, 23)
      dt2 = dt1.replace(tzinfo=pytz.timezone('Europe/Zurich'))
      dt3 = datetime.datetime(9999, 12, 31, 0, 0)
      dt4 = dt3.replace(tzinfo=pytz.timezone('Europe/Zurich'))
      
      toInternal(dt1) == TimestampType().toInternal(dt1)
      toInternal(dt2) == TimestampType().toInternal(dt2)
      toInternal(dt3) # TimestampType().toInternal(dt3) fails
      toInternal(dt4) == TimestampType().toInternal(dt4)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            mstuder Martin Studer
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: