Description
Currently DataFrame.toPandas() with arrow enabled or ArrowStreamPandasSerializer for pandas UDF with pyarrow<0.12 creates datetime64[ns] type series as intermediate data and then convert to datetime.date series, but the intermediate datetime64[ns] might cause an overflow even if the date is valid.
>>> import datetime >>> >>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)] >>> >>> df = spark.createDataFrame(t, 'date') >>> df.show() +----------+ | value| +----------+ |2262-04-12| |2263-04-12| +----------+ >>> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> >>> df.toPandas() value 0 1677-09-21 1 1678-09-21
We should avoid creating such intermediate data and create datetime.date series directly instead.
Attachments
Issue Links
- links to