Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
When using the standard factory, the pandas metadata is included in the schema metadata of the dataset, but when using the ParquetDatasetFactory, it is not included:
Using dask to write a small partitioned dataset with written _metadata file:
df = pd.DataFrame({"part": ["A", "A", "B", "B"], "col": [1, 2, 3, 4]}) import dask.dataframe as dd ddf = dd.from_pandas(df, npartitions=2) ddf.to_parquet("test_parquet_pandas_metadata/", engine="pyarrow")
In [9]: import pyarrow.dataset as ds # with ds.dataset -> pandas metadata included In [11]: ds.dataset("test_parquet_pandas_metadata/", format="parquet", partitioning="hive").schema Out[11]: part: string -- field metadata -- PARQUET:field_id: '1' col: int64 -- field metadata -- PARQUET:field_id: '2' index: int64 -- field metadata -- PARQUET:field_id: '3' -- schema metadata -- pandas: '{"index_columns": ["index"], "column_indexes": [{"name": null, "' + 558 # with parquet_dataset -> pandas metadata not included In [14]: ds.parquet_dataset("test_parquet_pandas_metadata/_metadata", partitioning="hive").schema Out[14]: part: string -- field metadata -- PARQUET:field_id: '1' col: int64 -- field metadata -- PARQUET:field_id: '2' index: int64 -- field metadata -- PARQUET:field_id: '3' # to show that the pandas metadata are present in the actual Parquet FileMetadata In [17]: pq.read_metadata("test_parquet_pandas_metadata/_metadata").metadata Out[17]: {b'ARROW:schema': b'/////4ADAAAQAAAAAAAKAA4AB...', b'pandas': b'{"index_columns": ["index"], ...'}
Attachments
Issue Links
- links to