Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Python example:
import pandas as pd import pyarrow.dataset as ds df = pd.DataFrame({'a': [1, 2, 3]}) df.to_parquet("test_metadata.parquet") dataset = ds.dataset("test_metadata.parquet")
gives:
>>> dataset.to_table().schema a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 397 ARROW:schema: '/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 806 >>> dataset.to_table(columns=['a']).schema a: int64 -- field metadata -- PARQUET:field_id: '1'
So when specifying a subset of the columns, the additional metadata entries are lost (while those can still be informative, eg for conversion to pandas)
Attachments
Issue Links
- links to