Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2079

[Python] Possibly use `_common_metadata` for schema if `_metadata` isn't available

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Python
    • Labels:

      Description

      Currently pyarrow's parquet writer only writes `_common_metadata` and not `_metadata`. From what I understand these are intended to contain the dataset schema but not any row group information.

       

      A few (possibly naive) questions:

       

      1. In the `_init_` for `ParquetDataset`, the following lines exist:

      if self.metadata_path is not None:
          with self.fs.open(self.metadata_path) as f:
              self.common_metadata = ParquetFile(f).metadata
      else:
          self.common_metadata = None
      

      I believe this should use `common_metadata_path` instead of `metadata_path`, as the latter is never written by `pyarrow`, and is given by the `_metadata` file instead of `_common_metadata` (as seemingly intended?).

       

      2. In `validate_schemas` I believe an option should exist for using the schema from `_common_metadata` instead of `_metadata`, as pyarrow currently only writes the former, and as far as I can tell `_common_metadata` does include all the schema information needed.

       

      Perhaps the logic in `validate_schemas` could be ported over to:

       

      if self.schema is not None:
          pass  # schema explicitly provided
      elif self.metadata is not None:
          self.schema = self.metadata.schema
      elif self.common_metadata is not None:
          self.schema = self.common_metadata.schema
      else:
          self.schema = self.pieces[0].get_metadata(open_file).schema

      If these changes are valid, I'd be happy to submit a PR. It's not 100% clear to me the difference between `_common_metadata` and `_metadata`, but I believe the schema in both should be the same. Figured I'd open this for discussion.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jim.crist Jim Crist
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: