Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16287

PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.0.0
    • None
    • Parquet
    • None
    • MacOS. Python 3.8.10.
      pyarrow: '7.0.0'
      pandas: '1.4.2'
      numpy: '1.22.3'

    Description

      I'm trying to follow the example here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:

      from pathlib import Path
      import numpy as np
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      size = 100_000_000
      partition_col = np.random.randint(0, 10, size)
      values = np.random.rand(size)
      table = pa.Table.from_pandas(
          pd.DataFrame({"partition_col": partition_col, "values": values})
      )
      metadata_collector = []
      root_path = Path("random.parquet")
      pq.write_to_dataset(
          table,
          root_path,
          partition_cols=["partition_col"],
          metadata_collector=metadata_collector,
      )
      
      Write the ``_common_metadata`` parquet file without row groups statistics
      pq.write_metadata(table.schema, root_path / "_common_metadata")
      
      
      Write the ``_metadata`` parquet file with row groups statistics of all files
      pq.write_metadata(
          table.schema, root_path / "_metadata", metadata_collector=metadata_collector
      ) 

      This raises the error

      ---------------------------------------------------------------------------
      RuntimeError                              Traceback (most recent call last)
      Input In [92], in <cell line: 1>()
      ----> 1 pq.write_metadata(
            2     table.schema, root_path / "_metadata", metadata_collector=metadata_collector
            3 )
      File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
         2322 metadata = read_metadata(where)
         2323 for m in metadata_collector:
      -> 2324     metadata.append_row_groups(m)
         2325 metadata.write_metadata_file(where)
      File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()
      RuntimeError: AppendRowGroups requires equal schemas. 

      But all schemas in the `metadata_collector` list seem to be the same:

      all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
      # True 

      Attachments

        Activity

          People

            Unassigned Unassigned
            kylebarron2 Kyle Barron
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: