Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0, 3.5.1
Description
Because of a limitation in PyArrow, when PyArrow Tables containing MapArray columns with nested fields or timestamps are passed to spark.createDataFrame(), null values in the MapArray columns are replaced with empty lists.
The PySpark function where this happens is pyspark.sql.pandas.types._check_arrow_array_timestamps_localize.
Also see https://github.com/apache/arrow/issues/41684.
See the skipped tests and the TODO mentioning SPARK-48302.
[Update] A fix for this has been implemented in PyArrow in https://github.com/apache/arrow/pull/41757 by adding a mask argument to pa.MapArray.from_arrays. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like:
LooseVersion(pa.__version__) >= LooseVersion("17.0.0")
or
from inspect import signature
"mask" in signature(pa.MapArray.from_arrays).parameters
and only pass mask if that's true.
Attachments
Issue Links
- relates to
-
SPARK-48220 Allow passing PyArrow Table to createDataFrame()
- Resolved
- links to