Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4076

[Python] schema validation and filters

    Details

      Description

      Currently schema validation of ParquetDataset takes place before filtering. This may raise a ValueError if the schema is different in some dataset pieces, even if these pieces would be subsequently filtered out. I think validation should happen after filtering to prevent such spurious errors:

      --- a/pyarrow/parquet.py	
      +++ b/pyarrow/parquet.py	
      @@ -878,13 +878,13 @@
               if split_row_groups:
                   raise NotImplementedError("split_row_groups not yet implemented")
       
      -        if validate_schema:
      -            self.validate_schemas()
      -
               if filters is not None:
                   filters = _check_filters(filters)
                   self._filter(filters)
       
      +        if validate_schema:
      +            self.validate_schemas()
      +
           def validate_schemas(self):
               open_file = self._get_open_file_func()
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                gsakkis George Sakkis
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m