Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7782

[Python] Losing index information when using write_to_dataset with partition_cols

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • Python
    • None
    • pyarrow==0.15.1

    Description

      One cannot save the index when using pyarrow.parquet.write_to_dataset() with given partition_cols arguments. Here I have created a minimal example which shows the issue:

       
      from pathlib import Path
      import pandas as pd
      from pyarrow import Table
      from pyarrow.parquet import write_to_dataset, read_table
      
      path = Path('/home/user/trials')
      file_name = 'local_database.parquet'
      df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, 
                        index=pd.Index(['a', 'b', 'c'], 
                        name='idx'))
      
      table = Table.from_pandas(df)
      write_to_dataset(table, 
                       str(path / file_name), 
                       partition_cols=['B']
                      )
      df_read = read_table(str(path / file_name))
      df_read.to_pandas()
      

       

      The issue is rather important for pandas and dask users.

      Attachments

        Activity

          People

            jorisvandenbossche Joris Van den Bossche
            LudwikB Ludwik Bielczynski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: