Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
Nightly pyarrow conda package on Ubuntu 18.04
Description
Using the `statistics` property on a `RowGroupInfo` object leads to an error if the corresponding row group is empty. I would expect this property to return `None` (or an empty statistics structure) in cases like this.
Reproducer:
import pandas as pd import pyarrow.dataset as ds path0 = "test.parquet" path1 = "test.empty.parquet" df = pd.DataFrame({"a": ["a", "b", "b"], "b": [4, 5, 6]}) df.to_parquet(path0, engine="pyarrow") df[:0].to_parquet(path1, engine="pyarrow") rg = ds.dataset(path0).get_fragments().__next__().row_groups[0] print("Populated Row Group Statistics:", rg.statistics) empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0] print("Empty Row Group Statistics:", empty_rg.statistics)
Output:
Populated Row Group Statistics: {'a': {'min': 'a', 'max': 'b'}, 'b': {'min': 4, 'max': 6}} --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-1-57ba8b32c7e5> in <module>() 13 14 empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0] ---> 15 print("Empty Row Group Statistics:", empty_rg.statistics) /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics() /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics.name_stats() AttributeError: 'NoneType' object has no attribute 'has_min_max'