Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
When reading a parquet file (which was written by Arrow) with the datasets API, it preserves the "ARROW:schema" field in the metadata:
import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({'a': [1, 2, 3]}) pq.write_table(table, "test.parquet") dataset = ds.dataset("test.parquet", format="parquet")
In [7]: dataset.schema Out[7]: a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114 In [8]: dataset.to_table().schema Out[8]: a: int64 -- field metadata -- PARQUET:field_id: '1' -- schema metadata -- ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114
while when reading with the `parquet` module reader, we do not preserve this metadata:
In [9]: pq.read_table("test.parquet").schema Out[9]: a: int64 -- field metadata -- PARQUET:field_id: '1'
Since the "ARROW:schema" information is used to properly reconstruct the Arrow schema from the ParquetSchema, it is no longer needed once you already have the arrow schema, so it's probably not needed to keep it as metadata in the arrow schema.
Attachments
Issue Links
- is related to
-
ARROW-8980 [Python] Metadata grows exponentially when using schema from disk
-
- Resolved
-