[ARROW-7087] [Python] Table Metadata disappear when we write a partitioned dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.14.1
Fix Version/s: 0.16.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/23394

Description

There is an unexpected behavior with the method write_to_dataset in pyarrow/parquet.py

When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined partition_cols.

To be more explicit here is an example code:

from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'})
print table.schema.metadata
"""
Metadata is set as expected

>> OrderedDict([('data', 'test')])
"""

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
"""
Metadata with the key `data` is missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": [{"metadata": null, "field_name": "columnA", "name": "columnA", "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
"""

Attachments

Issue Links

links to

GitHub Pull Request #6127

Activity

People

Assignee:: François Blanchard

Reporter:: François Blanchard

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Nov/19 15:51

Updated:: 11/Jan/23 07:51

Resolved:: 07/Jan/20 13:45

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m