Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44111 Prepare Apache Spark 4.0.0
  3. SPARK-48302

Preserve nulls in map columns in PyArrow Tables

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 4.0.0, 3.5.1
    • 4.0.0
    • PySpark

    Description

      Because of a limitation in PyArrow, when PyArrow Tables containing MapArray columns with nested fields or timestamps are passed to spark.createDataFrame(), null values in the MapArray columns are replaced with empty lists.

      The PySpark function where this happens is pyspark.sql.pandas.types._check_arrow_array_timestamps_localize.

      Also see https://github.com/apache/arrow/issues/41684.

      See the skipped tests and the TODO mentioning SPARK-48302.

      [Update] A fix for this has been implemented in PyArrow in https://github.com/apache/arrow/pull/41757 by adding a mask argument to pa.MapArray.from_arrays. This will be released in PyArrow 17.0.0. Since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like:

      LooseVersion(pa.__version__) >= LooseVersion("17.0.0")

      or

      from inspect import signature
      "mask" in signature(pa.MapArray.from_arrays).parameters

      and only pass mask if that's true.

      Attachments

        Issue Links

          Activity

            People

              icook Ian Cook
              icook Ian Cook
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: