Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9321

[C++][Dataset] Allow to "collect" statistics for ParquetFragment row groups if not constructed from _metadata

    XMLWordPrintableJSON

Details

    Description

      Right now, the statistics of the RowGroupInfo of ParquetFileFragments are only available when the dataset was constructed from a _metadata file:

      import pandas as pd
      df = pd.DataFrame({"part": ['A', 'A', 'B', 'B'], "col": range(4)})                                                                                                                                        
      # use dask to write partitioned dataset *with* _metadata file
      import dask.dataframe as dd                                                                                                                                                                               
      ddf = dd.from_pandas(df, npartitions=2) 
      ddf.to_parquet("test_dataset", partition_on=["part"], engine="pyarrow")                                                                                                                     
      
      import pyarrow.dataset as ds
      dataset_no_metadata = ds.dataset("test_dataset/", format="parquet", partitioning="hive")
      dataset_from_metadata = ds.parquet_dataset("test_dataset/_metadata", partitioning="hive")                                                                                                                 
      
      
      In [28]: list(dataset_no_metadata.get_fragments())[0].row_groups                                                                                                                                                   
      
      In [30]: list(dataset_from_metadata.get_fragments())[0].row_groups                                                                                                                                                 
      Out[30]: [<pyarrow._dataset.RowGroupInfo at 0x7fd7882c0030>]
      
      In [32]: list(dataset_from_metadata.get_fragments())[0].row_groups[0].statistics                                                                                                                                   
      Out[32]: {'col': {'min': 2, 'max': 3}, 'index': {'min': 2, 'max': 3}}
      

      For some applications (eg dask), one could want access to those statistics, even if the original dataset / fragments were not created from a _metadata file. This should not happen automatically since it's costly, but a method to trigger collecting all metadata would be useful.

      cc rjzamora

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 50m
                  3h 50m