Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.0
-
None
Description
The conversion of a PySpark dataframe with nested columns to Pandas (with `toPandas()`) does not convert nested columns into their Pandas equivalent, i.e. columns indexed by a MultiIndex.
For example, a dataframe with the following structure:
>>> df.printSchema() root |-- device_ID: string (nullable = true) |-- time_origin_UTC: timestamp (nullable = true) |-- duration_s: integer (nullable = true) |-- session_time_UTC: timestamp (nullable = true) |-- probes_by_AP: struct (nullable = true) | |-- aa:bb:cc:dd:ee:ff: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- delay_s: float (nullable = true) | | | |-- RSSI: short (nullable = true) |-- max_RSSI_info_by_AP: struct (nullable = true) | |-- aa:bb:cc:dd:ee:ff: struct (nullable = true) | | |-- delay_s: float (nullable = true) | | |-- RSSI: short (nullable = true)
yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is not nested inside Pandas (through a MultiIndex):
>>> df_pandas_version = df.toPandas() >>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # Should work! (…) KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI') >>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0] Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6)) >>> type(_) # PySpark type, instead of Pandas! pyspark.sql.types.Row
It would be much more convenient if the Spark dataframe did the conversion to Pandas more thoroughly.