Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
0.9.0
Description
When reading a parquet file containing a string column, the RowGroup statistics contain a trailing space character for the string column. The example below shows the behavior.
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # create and write arrow table as parquet df = pd.DataFrame({'string_column': ['some', 'string', 'values', 'here']}) table = pa.Table.from_pandas(df) pq.write_table(table, 'example.parquet') # read parquet file metadata and print string column statistics pq_file = pq.ParquetFile(open('example.parquet', 'rb')) print(pq_file.metadata.row_group(0).column(0).statistics.max) # yields b'values ' print(pq_file.metadata.row_group(0).column(0).statistics.min) # yields b'here '
For other data types I did not observe this problem, even though the statistics are always strings.
When reading the same file with fastparquet, there is no trailing space character, which implies that this problem occurs in the reading path of pyarrow.parquet. I am aware that this might well be an issue with parquet-cpp, but as I face this bug as a pyarrow user, I report it here.
I'll try to investigate this further and report back here.
Update:
The trailing space is added in parquet-cpp. pyarrow calls the function FormatStatValue which adds the trailing space (https://github.com/apache/parquet-cpp/blob/master/src/parquet/types.cc#L52). There is no comment there to explain it. Does anyone here know what the reason is?