Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4139

[Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is set

    XMLWordPrintableJSON

    Details

      Description

      When writing Pandas data to Parquet format and reading it back again I find that that statistics of text columns are stored as byte arrays rather than as unicode text.

      I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding of how best to manage statistics. (I'd be quite happy to learn that it was the latter).

      Here is a minimal example

      import pandas as pd
      df = pd.DataFrame({'x': ['a']})
      df.to_parquet('df.parquet')
      import pyarrow.parquet as pq
      pf = pq.ParquetDataset('df.parquet')
      piece = pf.pieces[0]
      rg = piece.row_group(0)
      md = piece.get_metadata(pq.ParquetFile)
      rg = md.row_group(0)
      c = rg.column(0)
      
      >>> c
      <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238>
        file_offset: 63
        file_path: 
        physical_type: BYTE_ARRAY
        num_values: 1
        path_in_schema: x
        is_stats_set: True
        statistics:
          <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418>
            has_min_max: True
            min: b'a'
            max: b'a'
            null_count: 0
            distinct_count: 0
            num_values: 1
            physical_type: BYTE_ARRAY
        compression: SNAPPY
        encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
        has_dictionary_page: True
        dictionary_page_offset: 4
        data_page_offset: 25
        total_compressed_size: 59
        total_uncompressed_size: 55
      
      >>> type(c.statistics.min)
      bytes
      

      My guess is that we would want to store a logical type in the statistics like UNICODE, though I don't have enough experience with Parquet data types to know if this is a good idea or possible.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesm Wes McKinney
                Reporter:
                mrocklin Matthew Rocklin
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 10m
                  4h 10m