Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10778

[Python] RowGroupInfo.statistics errors for empty row group

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 3.0.0
    • Python
    • Nightly pyarrow conda package on Ubuntu 18.04

    Description

      Using the `statistics` property on a `RowGroupInfo` object leads to an error if the corresponding row group is empty.  I would expect this property to return `None` (or an empty statistics structure) in cases like this.

      Reproducer:

       

      import pandas as pd
      import pyarrow.dataset as ds
       
      path0 = "test.parquet"
      path1 = "test.empty.parquet"
      df = pd.DataFrame({"a": ["a", "b", "b"], "b": [4, 5, 6]})
      df.to_parquet(path0, engine="pyarrow")
      df[:0].to_parquet(path1, engine="pyarrow")
      rg = ds.dataset(path0).get_fragments().__next__().row_groups[0]
      print("Populated Row Group Statistics:", rg.statistics)
      empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0]
      print("Empty Row Group Statistics:", empty_rg.statistics)
      

      Output: 
       

      Populated Row Group Statistics: {'a': {'min': 'a', 'max': 'b'}, 'b': {'min': 4, 'max': 6}}   --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-1-57ba8b32c7e5> in <module>()  13  14 empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0] ---> 15 print("Empty Row Group Statistics:", empty_rg.statistics) /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics() /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics.name_stats() AttributeError: 'NoneType' object has no attribute 'has_min_max'

       

      Attachments

        Activity

          People

            rjzamora Rick Zamora
            rjzamora Rick Zamora
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 0.5h
                0.5h