Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.14.1
Description
There is an unexpected behavior with the method write_to_dataset in pyarrow/parquet.py
When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined partition_cols.
To be more explicit here is an example code:
from pyarrow.parquet import write_to_dataset import pyarrow as pa import pyarrow.parquet as pd columnA = pa.array(['a', 'b', 'c'], type=pa.string()) columnB = pa.array([1, 1, 2], type=pa.int32()) # Build table from collumns table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'}) print table.schema.metadata """ Metadata is set as expected >> OrderedDict([('data', 'test')]) """ # Write table in parquet format partitioned per columnB write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) # Load data from parquet files ds = pd.ParquetDataset('/path/to/test') load_table = pq.read_table(ds.pieces[0].path) print load_table.schema.metadata """ Metadata with the key `data` is missing >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": [{"metadata": null, "field_name": "columnA", "name": "columnA", "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')]) """
Attachments
Issue Links
- links to