Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1132

[Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.4.1
    • Fix Version/s: 0.5.0
    • Component/s: Python
    • Labels:
      None
    • Environment:
      OSx, miniconda, using pyarrow build from conda-forge

      Description

      Panda DataFrames that have `MultiIndex`es seem to always be converted to a `Table` just fine. However, when writing the `Table` to disk using `pyarrow.parquet`, I am unable to write DataFrames whose `MultiIndex` contains a level with duplicate values (which is nearly always the case for me). Here is an example in python with working cases and a failure case at bottom:

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      num_rows = 3
      example = pd.DataFrame({'strs': ['foo', 'foo', 'bar'],
                              'nums_b': range(num_rows),
                              'nums_a': range(num_rows)})
      
      
      def pq_write(df):
          table = pa.Table.from_pandas(df)
          pq.write_table(table, '/tmp/df.parquet')
      
      # single index works
      pq_write(example)
      pq_write(example.set_index(['nums_b']))
      
      # single index with duplicate values work
      pq_write(example.set_index(['strs']))
      
      # MultiIndex with all unique, relative to the level/column, values works
      pq_write(example.set_index(['nums_b', 'nums_a']))
      
      # MultiIndex with one level with duplicate values in one index FAILS
      pq_write(example.set_index(['strs', 'nums_a']))
      
      Traceback (most recent call last):
        File "test_arrow.py", line 26, in <module>
          pq_write(example.set_index(['strs', 'nums_a']))
        File "test_arrow.py", line 13, in pq_write
          pq.write_table(table, '/tmp/df.parquet')
        File "/Users/bmabey/anaconda/envs/test_pyarrow/lib/python3.5/site-packages/pyarrow/parquet.py", line 702, in write_table
          writer.write_table(table, row_group_size=row_group_size)
        File "pyarrow/_parquet.pyx", line 609, in pyarrow._parquet.ParquetWriter.write_table (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/_parquet.cxx:11025)
        File "pyarrow/error.pxi", line 60, in pyarrow.lib.check_status (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/lib.cxx:6899)
      pyarrow.lib.ArrowIOError: IOError: Written rows: 2 != expected rows: 3in the current column chunk
      

      Note that the written rows is equal to the number of unique values in the `strs` level. I have found this to always be the case when I've hit this error message.

      I'm happy to write a patch for this assuming this is a bug and you can point me in the right direction.

        Attachments

          Activity

            People

            • Assignee:
              cpcloud Phillip Cloud
              Reporter:
              bmabey Ben Mabey
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: