Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2503

[Python] Trailing space character in RowGroup statistics of pyarrow.parquet.ParquetFile

    XMLWordPrintableJSON

Details

    Description

      When reading a parquet file containing a string column, the RowGroup statistics contain a trailing space character for the string column. The example below shows the behavior.

      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      # create and write arrow table as parquet
      df = pd.DataFrame({'string_column': ['some', 'string', 'values', 'here']})
      table = pa.Table.from_pandas(df)
      pq.write_table(table, 'example.parquet')
      
      # read parquet file metadata and print string column statistics
      pq_file = pq.ParquetFile(open('example.parquet', 'rb'))
      print(pq_file.metadata.row_group(0).column(0).statistics.max) # yields b'values '
      print(pq_file.metadata.row_group(0).column(0).statistics.min) # yields b'here '
      

      For other data types I did not observe this problem, even though the statistics are always strings.

      When reading the same file with fastparquet, there is no trailing space character, which implies that this problem occurs in the reading path of pyarrow.parquet. I am aware that this might well be an issue with parquet-cpp, but as I face this bug as a pyarrow user, I report it here.

      I'll try to investigate this further and report back here.

       
      Update:

      The trailing space is added in parquet-cpp. pyarrow calls the function FormatStatValue which adds the trailing space (https://github.com/apache/parquet-cpp/blob/master/src/parquet/types.cc#L52). There is no comment there to explain it. Does anyone here know what the reason is?

      Attachments

        Activity

          People

            jneuffer Julius Neuffer
            jneuff Julius Neuffer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 10m
                2h 10m