Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1245

[C++] Segfault when writing Arrow table with duplicate columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • cpp-1.5.0
    • None
    • Linux Mint 18.2
      Anaconda Python distribution + pyarrow installed from the conda-forge channel

    Description

      I accidentally created a large number of Parquet files with two _index_level_0_ columns (through a Spark SQL query).

      PyArrow can read these files into tables, but it segfaults when converting the resulting tables to Pandas DataFrames or when saving the tables to Parquet files.

      # Duplicate columns cause segmentation faults
      table = pq.read_table('/path/to/duplicate_column_file.parquet')
      table.to_pandas()  # Segmentation fault
      pq.write_table(table, '/some/output.parquet') # Segmentation fault
      

      If I remove the duplicate column using table.remove_column(...) everything works without segfaults.

      # After removing duplicate columns, everything works fine
      table = pq.read_table('/path/to/duplicate_column_file.parquet')
      table.remove_column(34)
      table.to_pandas()  # OK
      pq.write_table(table, '/some/output.parquet')  # OK
      

      For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              ostrokach Alexey Strokach
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: