[ARROW-8208] [PYTHON] Row Group Filtering With ParquetDataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/24405

Description

Hello,

I tried to use the row_group filtering at the file level with an instance of ParquetDataset without success.

I've tested the workaround proposed here:
https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883

But I wonder if it can work on a file as I get an exception with the following code:

ParquetDataset('data.parquet',
               filters=[('ticker', '=', 'AAPL')]).read().to_pandas()

AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'

I read the documentation, and the filtering seems to work only on partitioned dataset. Moreover I read some information in the following JIRA ticket: ~~ARROW-1796~~

So I'm not sure that a ParquetDataset can use row_group statistics to filter specific row_group in a file (in a dataset or not)?

As mentioned in ~~ARROW-1796~~, I tried with fastparquet, and after fixing a bug (statistics.min instead of statistics.min_value), I was able to apply the row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem with small files...

Attachments

Issue Links

relates to

ARROW-3764 [C++] Port Python "ParquetDataset" business logic to C++

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Christophe Clienti

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Mar/20 10:49

Updated:: 11/Jan/23 07:58

Resolved:: 14/Apr/20 08:25