Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21537

toPandas() should handle nested columns (as a Pandas MultiIndex)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • PySpark

    Description

      The conversion of a PySpark dataframe with nested columns to Pandas (with `toPandas()`) does not convert nested columns into their Pandas equivalent, i.e. columns indexed by a MultiIndex.

      For example, a dataframe with the following structure:

      >>> df.printSchema()
      root
       |-- device_ID: string (nullable = true)
       |-- time_origin_UTC: timestamp (nullable = true)
       |-- duration_s: integer (nullable = true)
       |-- session_time_UTC: timestamp (nullable = true)
       |-- probes_by_AP: struct (nullable = true)
       |    |-- aa:bb:cc:dd:ee:ff: array (nullable = true)
       |    |    |-- element: struct (containsNull = true)
       |    |    |    |-- delay_s: float (nullable = true)
       |    |    |    |-- RSSI: short (nullable = true)
       |-- max_RSSI_info_by_AP: struct (nullable = true)
       |    |-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
       |    |    |-- delay_s: float (nullable = true)
       |    |    |-- RSSI: short (nullable = true)
      

      yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is not nested inside Pandas (through a MultiIndex):

      >>> df_pandas_version = df.toPandas()
      >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # Should work!
      (…)
      KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')
      >>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0]
      Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))
      >>> type(_)  # PySpark type, instead of Pandas!
      pyspark.sql.types.Row
      

      It would be much more convenient if the Spark dataframe did the conversion to Pandas more thoroughly.

      Attachments

        Activity

          People

            Unassigned Unassigned
            lebigot Eric O. LEBIGOT (EOL)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: