Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7087

[Python] Table Metadata disappear when we write a partitioned dataset

    XMLWordPrintableJSON

Details

    Description

      There is an unexpected behavior with the method write_to_dataset in pyarrow/parquet.py

      When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined partition_cols.

       

      To be more explicit here is an example code: 

      from pyarrow.parquet import write_to_dataset
      import pyarrow as pa
      import pyarrow.parquet as pd
      
      columnA = pa.array(['a', 'b', 'c'], type=pa.string())
      columnB = pa.array([1, 1, 2], type=pa.int32())
      
      # Build table from collumns
      table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'})
      print table.schema.metadata
      """
      Metadata is set as expected
      
      >> OrderedDict([('data', 'test')])
      """
      
      # Write table in parquet format partitioned per columnB
      write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
      
      # Load data from parquet files
      ds = pd.ParquetDataset('/path/to/test')
      load_table = pq.read_table(ds.pieces[0].path)
      print load_table.schema.metadata
      """
      Metadata with the key `data` is missing
      
      
      >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": [{"metadata": null, "field_name": "columnA", "name": "columnA", "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
      """

       
       
       

      Attachments

        Issue Links

          Activity

            People

              TinyFrancois François Blanchard
              TinyFrancois François Blanchard
              Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m