Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1982

[Python] Return parquet statistics min/max as values instead of strings

    Details

      Description

      Currently `min` and `max` column statistics are returned as formatted strings of the physical type. This makes using them in python a bit tricky, as the strings need to be parsed as the proper logical type. Observe:

      In [20]: import pandas as pd
      
      In [21]: df = pd.DataFrame({'a': [1, 2, 3],
          ...:                    'b': ['a', 'b', 'c'],
          ...:                    'c': [pd.Timestamp('1991-01-01')]*3})
          ...:
      
      In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
      
      In [23]: from pyarrow import parquet as pq
      
      In [24]: f = pq.ParquetFile('temp.parquet')
      
      In [25]: rg = f.metadata.row_group(0)
      
      In [26]: rg.column(0).statistics.min  # string instead of integer
      Out[26]: '1'
      
      In [27]: rg.column(1).statistics.min  # weird space added after value due to formatter
      Out[27]: 'a '
      
      In [28]: rg.column(2).statistics.min  # formatted as physical type (int) instead of logical (datetime)
      Out[28]: '662688000000'
      

      Since the type information is known, it should be possible to convert these to arrow values instead of strings.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesmckinn Wes McKinney
                Reporter:
                jim.crist Jim Crist
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: