Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
-
Linux Mint 18.2
Anaconda Python distribution + pyarrow installed from the conda-forge channel
Description
I accidentally created a large number of Parquet files with two _index_level_0_ columns (through a Spark SQL query).
PyArrow can read these files into tables, but it segfaults when converting the resulting tables to Pandas DataFrames or when saving the tables to Parquet files.
# Duplicate columns cause segmentation faults table = pq.read_table('/path/to/duplicate_column_file.parquet') table.to_pandas() # Segmentation fault pq.write_table(table, '/some/output.parquet') # Segmentation fault
If I remove the duplicate column using table.remove_column(...) everything works without segfaults.
# After removing duplicate columns, everything works fine table = pq.read_table('/path/to/duplicate_column_file.parquet') table.remove_column(34) table.to_pandas() # OK pq.write_table(table, '/some/output.parquet') # OK
For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.
Attachments
Issue Links
- links to