Type: New Feature
Affects Version/s: None
Fix Version/s: None
I tried to use the row_group filtering at the file level with an instance of ParquetDataset without success.
I've tested the workaround proposed here:
But I wonder if it can work on a file as I get an exception with the following code:
I read the documentation, and the filtering seems to work only on partitioned dataset. Moreover I read some information in the following JIRA ticket: ARROW-1796
So I'm not sure that a ParquetDataset can use row_group statistics to filter specific row_group in a file (in a dataset or not)?
As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug (statistics.min instead of statistics.min_value), I was able to apply the row_group filtering.
Today I'm forced with pyarrow to filter manually the row_groups in each file, which prevents me to use the ParquetDataset partition filtering functionality.
The row groups are really useful because it prevents to fill the filesystem with small files...