Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26887

Create datetime.date directly instead of creating datetime64[ns] as intermediate data.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 3.0.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      Currently DataFrame.toPandas() with arrow enabled or ArrowStreamPandasSerializer for pandas UDF with pyarrow<0.12 creates datetime64[ns] type series as intermediate data and then convert to datetime.date series, but the intermediate datetime64[ns] might cause an overflow even if the date is valid.

      >>> import datetime
      >>>
      >>> t  = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)]
      >>>
      >>> df = spark.createDataFrame(t, 'date')
      >>> df.show()
      +----------+
      |     value|
      +----------+
      |2262-04-12|
      |2263-04-12|
      +----------+
      
      >>>
      >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      >>>
      >>> df.toPandas()
              value
      0  1677-09-21
      1  1678-09-21
      

      We should avoid creating such intermediate data and create datetime.date series directly instead.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ueshin Takuya Ueshin
                Reporter:
                ueshin Takuya Ueshin
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: