[SPARK-48302] Preserve nulls in map columns in PyArrow Tables - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0, 3.5.1
Fix Version/s: 4.0.0
Component/s: PySpark
Labels:
- pull-request-available

Target Version/s:

4.0.0
Language:
- Python

Description

Because of a limitation in PyArrow, when PyArrow Tables containing MapArray columns with nested fields or timestamps are passed to spark.createDataFrame(), null values in the MapArray columns are replaced with empty lists.

The PySpark function where this happens is pyspark.sql.pandas.types._check_arrow_array_timestamps_localize.

Also see https://github.com/apache/arrow/issues/41684.

See the skipped tests and the TODO mentioning ~~SPARK-48302~~.

[Update] A fix for this has been implemented in PyArrow in https://github.com/apache/arrow/pull/41757 by adding a mask argument to pa.MapArray.from_arrays. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like:

LooseVersion(pa.__version__) >= LooseVersion("17.0.0")

from inspect import signature
"mask" in signature(pa.MapArray.from_arrays).parameters

and only pass mask if that's true.

Attachments

Issue Links

relates to

SPARK-48220 Allow passing PyArrow Table to createDataFrame()

Resolved

links to

GitHub Pull Request #46837

Activity

People

Assignee:: Ian Cook

Reporter:: Ian Cook

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/May/24 07:08

Updated:: 16/Jun/24 05:12

Resolved:: 16/Jun/24 05:12