[PARQUET-1245] [C++] Segfault when writing Arrow table with duplicate columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: cpp-1.5.0
Component/s: None
Labels:
- pull-request-available
Environment:

Linux Mint 18.2
Anaconda Python distribution + pyarrow installed from the conda-forge channel

External issue URL:
https://github.com/apache/arrow/issues/1461

Description

I accidentally created a large number of Parquet files with two _index_level_0_ columns (through a Spark SQL query).

PyArrow can read these files into tables, but it segfaults when converting the resulting tables to Pandas DataFrames or when saving the tables to Parquet files.

# Duplicate columns cause segmentation faults
table = pq.read_table('/path/to/duplicate_column_file.parquet')
table.to_pandas()  # Segmentation fault
pq.write_table(table, '/some/output.parquet') # Segmentation fault

If I remove the duplicate column using table.remove_column(...) everything works without segfaults.

# After removing duplicate columns, everything works fine
table = pq.read_table('/path/to/duplicate_column_file.parquet')
table.remove_column(34)
table.to_pandas()  # OK
pq.write_table(table, '/some/output.parquet')  # OK

For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.

Attachments

Issue Links

links to

GitHub Pull Request #447

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Alexey Strokach

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 07/Jan/18 03:34

Updated:: 23/Jun/24 03:30

Resolved:: 22/Mar/18 23:53