Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31441

Support duplicated column names for toPandas with Arrow execution.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.5, 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      When we execute toPandas() with Arrow execution, it fails if the column names have duplicates.

      >>> spark.sql("select 1 v, 1 v").toPandas()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
          pdf = table.to_pandas()
        File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
        File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
        File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.7/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
          columns = _deserialize_column_index(table, all_columns, column_indexes)
        File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
          columns = _flatten_single_level_multiindex(columns)
        File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
          raise ValueError('Found non-unique column index')
      ValueError: Found non-unique column index
      

        Attachments

          Activity

            People

            • Assignee:
              ueshin Takuya Ueshin
              Reporter:
              ueshin Takuya Ueshin
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: