[SPARK-21537] toPandas() should handle nested columns (as a Pandas MultiIndex) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: PySpark
Labels:
- bulk-closed
- pandas

Description

The conversion of a PySpark dataframe with nested columns to Pandas (with `toPandas()`) does not convert nested columns into their Pandas equivalent, i.e. columns indexed by a MultiIndex.

For example, a dataframe with the following structure:

>>> df.printSchema()
root
 |-- device_ID: string (nullable = true)
 |-- time_origin_UTC: timestamp (nullable = true)
 |-- duration_s: integer (nullable = true)
 |-- session_time_UTC: timestamp (nullable = true)
 |-- probes_by_AP: struct (nullable = true)
 |    |-- aa:bb:cc:dd:ee:ff: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- delay_s: float (nullable = true)
 |    |    |    |-- RSSI: short (nullable = true)
 |-- max_RSSI_info_by_AP: struct (nullable = true)
 |    |-- aa:bb:cc:dd:ee:ff: struct (nullable = true)
 |    |    |-- delay_s: float (nullable = true)
 |    |    |-- RSSI: short (nullable = true)

yields a Pandas dataframe where the `max_RSSI_info_by_AP` column is not nested inside Pandas (through a MultiIndex):

>>> df_pandas_version = df.toPandas()
>>> df_pandas_version["max_RSSI_info_by_AP", "aa:bb:cc:dd:ee:ff", "RSSI"]. # Should work!
(…)
KeyError: ('max_RSSI_info_by_AP', 'aa:bb:cc:dd:ee:ff', 'RSSI')
>>> df_pandas_version["max_RSSI_info_by_AP"].iloc[0]
Row(aa:bb:cc:dd:ee:ff=Row(delay_s=0.0, RSSI=6))
>>> type(_)  # PySpark type, instead of Pandas!
pyspark.sql.types.Row

It would be much more convenient if the Spark dataframe did the conversion to Pandas more thoroughly.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Eric O. LEBIGOT (EOL)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Jul/17 10:24

Updated:: 21/May/19 04:16

Resolved:: 21/May/19 04:16