Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Related to ARROW-9730, parsing of the statistics in parquet metadata is expensive, and therefore should be avoided when possible.
For example, the ParquetDatasetFactory (ds.parquet_dataset() in python) parses all statistics of all files and all columns. While when doing a filtered read, you might only need the statistics of certain files (eg if a filter on a partition field already excluded many files) and certain columns (eg only the columns on which you are actually filtering).
The current API is a bit all-or-nothing (both ParquetDatasetFactory, or a later EnsureCompleteMetadata parse all statistics, and don't allow parsing a subset, or only parsing the other (non-statistics) metadata, ...), so I think we should try to think of better abstractions.
Attachments
Issue Links
- links to