Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
Description
Currently pyarrow's parquet writer only writes `_common_metadata` and not `_metadata`. From what I understand these are intended to contain the dataset schema but not any row group information.
A few (possibly naive) questions:
1. In the `_init_` for `ParquetDataset`, the following lines exist:
if self.metadata_path is not None: with self.fs.open(self.metadata_path) as f: self.common_metadata = ParquetFile(f).metadata else: self.common_metadata = None
I believe this should use `common_metadata_path` instead of `metadata_path`, as the latter is never written by `pyarrow`, and is given by the `_metadata` file instead of `_common_metadata` (as seemingly intended?).
2. In `validate_schemas` I believe an option should exist for using the schema from `_common_metadata` instead of `_metadata`, as pyarrow currently only writes the former, and as far as I can tell `_common_metadata` does include all the schema information needed.
Perhaps the logic in `validate_schemas` could be ported over to:
if self.schema is not None: pass # schema explicitly provided elif self.metadata is not None: self.schema = self.metadata.schema elif self.common_metadata is not None: self.schema = self.common_metadata.schema else: self.schema = self.pieces[0].get_metadata(open_file).schema
If these changes are valid, I'd be happy to submit a PR. It's not 100% clear to me the difference between `_common_metadata` and `_metadata`, but I believe the schema in both should be the same. Figured I'd open this for discussion.
Attachments
Issue Links
- is related to
-
ARROW-2209 [Python] Partition columns are not correctly loaded in schema of ParquetDataset
-
- Resolved
-
-
ARROW-8062 [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file
-
- Resolved
-
- relates to
-
ARROW-8446 [Python][Dataset] Detect and use _metadata file in a list of file paths
-
- Open
-