Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4076

[Python] schema validation and filters

    Details

      Description

      Currently schema validation of ParquetDataset takes place before filtering. This may raise a ValueError if the schema is different in some dataset pieces, even if these pieces would be subsequently filtered out. I think validation should happen after filtering to prevent such spurious errors:

      --- a/pyarrow/parquet.py	
      +++ b/pyarrow/parquet.py	
      @@ -878,13 +878,13 @@
               if split_row_groups:
                   raise NotImplementedError("split_row_groups not yet implemented")
       
      -        if validate_schema:
      -            self.validate_schemas()
      -
               if filters is not None:
                   filters = _check_filters(filters)
                   self._filter(filters)
       
      +        if validate_schema:
      +            self.validate_schemas()
      +
           def validate_schemas(self):
               open_file = self._get_open_file_func()
      

        Attachments

          Activity

            People

            • Assignee:
              jorisvandenbossche Joris Van den Bossche
              Reporter:
              gsakkis George Sakkis
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1.5h
                1.5h