Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16431

[C++][Parquet] Improve error message in append_row_groups() when appending disjoint metadata

Details

    Description

      Currently if you try to append together metadata from row groups with different schemas , you get the following error:

        File "/home/mmilton/.conda/envs/mmilton/envs/driverpipe/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py", line 52, in _append_row_groups
          metadata.append_row_groups(md)
        File "pyarrow/_parquet.pyx", line 628, in pyarrow._parquet.FileMetaData.append_row_groups
          self._metadata.AppendRowGroups(deref(c_metadata))
      RuntimeError: AppendRowGroups requires equal schemas.
      

      What would be useful here is to actually pass the schema difference in the error object in terms of which columns disagree. This information should also be in the error message.

      For example if it said:

      RuntimeError: AppendRowGroups requires equal schemas. Column "foo" was previously an int32 but the latest row group is storing it as an int64
      

      Attachments

        Issue Links

          Activity

            People

              milesgranger Miles Granger
              multimeric Michael Milton
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m

                  Slack

                    Issue deployment