Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26887

Create datetime.date directly instead of creating datetime64[ns] as intermediate data.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • PySpark
    • None

    Description

      Currently DataFrame.toPandas() with arrow enabled or ArrowStreamPandasSerializer for pandas UDF with pyarrow<0.12 creates datetime64[ns] type series as intermediate data and then convert to datetime.date series, but the intermediate datetime64[ns] might cause an overflow even if the date is valid.

      >>> import datetime
      >>>
      >>> t  = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)]
      >>>
      >>> df = spark.createDataFrame(t, 'date')
      >>> df.show()
      +----------+
      |     value|
      +----------+
      |2262-04-12|
      |2263-04-12|
      +----------+
      
      >>>
      >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      >>>
      >>> df.toPandas()
              value
      0  1677-09-21
      1  1678-09-21
      

      We should avoid creating such intermediate data and create datetime.date series directly instead.

      Attachments

        Issue Links

          Activity

            People

              ueshin Takuya Ueshin
              ueshin Takuya Ueshin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: