[ARROW-9321] [C++][Dataset] Allow to "collect" statistics for ParquetFragment row groups if not constructed from _metadata - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/17257

Description

Right now, the statistics of the RowGroupInfo of ParquetFileFragments are only available when the dataset was constructed from a _metadata file:

import pandas as pd
df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)})                                                                                                                                        
# use dask to write partitioned dataset *with* _metadata file
import dask.dataframe as dd                                                                                                                                                                               
ddf = dd.from_pandas(df, npartitions=2) 
ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow")                                                                                                                     

import pyarrow.dataset as ds
dataset_no_metadata = ds.dataset("test_dataset/", format="parquet", partitioning="hive")
dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata", partitioning="hive")


In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups                                                                                                                                                   

In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups                                                                                                                                                 
Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>]

In [32]: list(dataset_from_metadata.get_fragments())[0].row_groups[0].statistics                                                                                                                                   
Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}}

For some applications (eg dask), one could want access to those statistics, even if the original dataset / fragments were not created from a _metadata file. This should not happen automatically since it's costly, but a method to trigger collecting all metadata would be useful.

cc rjzamora

Attachments

Issue Links

links to

GitHub Pull Request #7692

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Jul/20 14:47

Updated:: 11/Jan/23 08:06

Resolved:: 12/Jul/20 22:53

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 50m