Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1982

[Python] Return parquet statistics min/max as values instead of strings

    XMLWordPrintableJSON

Details

    Description

      Currently `min` and `max` column statistics are returned as formatted strings of the physical type. This makes using them in python a bit tricky, as the strings need to be parsed as the proper logical type. Observe:

      In [20]: import pandas as pd
      
      In [21]: df = pd.DataFrame({'a': [1, 2, 3],
          ...:                    'b': ['a', 'b', 'c'],
          ...:                    'c': [pd.Timestamp('1991-01-01')]*3})
          ...:
      
      In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
      
      In [23]: from pyarrow import parquet as pq
      
      In [24]: f = pq.ParquetFile('temp.parquet')
      
      In [25]: rg = f.metadata.row_group(0)
      
      In [26]: rg.column(0).statistics.min  # string instead of integer
      Out[26]: '1'
      
      In [27]: rg.column(1).statistics.min  # weird space added after value due to formatter
      Out[27]: 'a '
      
      In [28]: rg.column(2).statistics.min  # formatted as physical type (int) instead of logical (datetime)
      Out[28]: '662688000000'
      

      Since the type information is known, it should be possible to convert these to arrow values instead of strings.

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              jim.crist Jim Crist
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: