Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
Description
When writing Pandas data to Parquet format and reading it back again I find that that statistics of text columns are stored as byte arrays rather than as unicode text.
I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding of how best to manage statistics. (I'd be quite happy to learn that it was the latter).
Here is a minimal example
import pandas as pd df = pd.DataFrame({'x': ['a']}) df.to_parquet('df.parquet') import pyarrow.parquet as pq pf = pq.ParquetDataset('df.parquet') piece = pf.pieces[0] rg = piece.row_group(0) md = piece.get_metadata(pq.ParquetFile) rg = md.row_group(0) c = rg.column(0) >>> c <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238> file_offset: 63 file_path: physical_type: BYTE_ARRAY num_values: 1 path_in_schema: x is_stats_set: True statistics: <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418> has_min_max: True min: b'a' max: b'a' null_count: 0 distinct_count: 0 num_values: 1 physical_type: BYTE_ARRAY compression: SNAPPY encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') has_dictionary_page: True dictionary_page_offset: 4 data_page_offset: 25 total_compressed_size: 59 total_uncompressed_size: 55 >>> type(c.statistics.min) bytes
My guess is that we would want to store a logical type in the statistics like UNICODE, though I don't have enough experience with Parquet data types to know if this is a good idea or possible.
Attachments
Issue Links
- relates to
-
ARROW-5166 [Python][Parquet] Statistics for uint64 columns may overflow
- Resolved
- links to