Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Right now, the statistics of the RowGroupInfo of ParquetFileFragments are only available when the dataset was constructed from a _metadata file:
import pandas as pd df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)}) # use dask to write partitioned dataset *with* _metadata file import dask.dataframe as dd ddf = dd.from_pandas(df, npartitions=2) ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow") import pyarrow.dataset as ds dataset_no_metadata = ds.dataset("test_dataset/", format="parquet", partitioning="hive") dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata", partitioning="hive")
In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>] In [32]: list(dataset_from_metadata.get_fragments())[0].row_groups[0].statistics Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}}
For some applications (eg dask), one could want access to those statistics, even if the original dataset / fragments were not created from a _metadata file. This should not happen automatically since it's costly, but a method to trigger collecting all metadata would be useful.
cc rjzamora
Attachments
Issue Links
- links to