Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2503

[Python] Trailing space character in RowGroup statistics of pyarrow.parquet.ParquetFile

    XMLWordPrintableJSON

    Details

      Description

      When reading a parquet file containing a string column, the RowGroup statistics contain a trailing space character for the string column. The example below shows the behavior.

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      # create and write arrow table as parquet
      df = pd.DataFrame({'string_column': ['some', 'string', 'values', 'here']})
      table = pa.Table.from_pandas(df)
      pq.write_table(table, 'example.parquet')
      
      # read parquet file metadata and print string column statistics
      pq_file = pq.ParquetFile(open('example.parquet', 'rb'))
      print(pq_file.metadata.row_group(0).column(0).statistics.max) # yields b'values '
      print(pq_file.metadata.row_group(0).column(0).statistics.min) # yields b'here '
      

      For other data types I did not observe this problem, even though the statistics are always strings.

      When reading the same file with fastparquet, there is no trailing space character, which implies that this problem occurs in the reading path of pyarrow.parquet. I am aware that this might well be an issue with parquet-cpp, but as I face this bug as a pyarrow user, I report it here.

      I'll try to investigate this further and report back here.

       
      Update:

      The trailing space is added in parquet-cpp. pyarrow calls the function FormatStatValue which adds the trailing space (https://github.com/apache/parquet-cpp/blob/master/src/parquet/types.cc#L52). There is no comment there to explain it. Does anyone here know what the reason is?

        Attachments

          Activity

            People

            • Assignee:
              jneuffer Julius Neuffer
              Reporter:
              jneuff Julius Neuffer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 10m
                2h 10m