Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11792

PyArrow unable to read file with large string values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Python
    • None
    • Scientific Linux 7.9; PyArrow 3.0.0, Pandas 1.0.5

    Description

      I am having difficulty re-reading a Parquet file written out using Pandas. The error message hints that either the file was malformed on write, or possibly that it is corrupt on disk (hard for me to confirm or deny that option - if there's an easy way for me to check, let me know).

      The original Pandas dataframe consisted of around 50 million rows with four columns. Three columns are simple `float` data, while the fourth is a string-typed column containing long strings, averaging 200 characters. Each string value is present in 20-30 rows, giving around 2 million unique strings. This is currently where my suspicion lies if it is an issue with pyarrow.

      The file was written out with df.to_parquet(compression="brotli").

      As well as pyarrow 3.0.0, I have quickly tried 2.0.0 and 1.0.1, both of which fail to read. Re-generating the data and writing takes several hours, annoyingly - a test on a smaller dataset produces a readable file.

      I am able to read the metadata of the file with PyArrow, which looks as I expect. The full metadata is attached in JSON format.

      >>> pyarrow.parquet.read_metadata("builtenv_vulns_bad.parquet")
      <pyarrow._parquet.FileMetaData object at 0x7f8ae91f88e0>
      created_by: parquet-cpp version 1.5.1-SNAPSHOT
      num_columns: 4
      num_rows: 55761732
      num_row_groups: 1
      format_version: 1.0
      serialized_size: 3213

      I can provide the problematic file privately - it's around 250MB.

      {{
      [...snip...]
      df = pd.read_parquet(data_source, columns=columns)
      File "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py", line 312, in read_parquet
      return impl.read(path, columns=columns, **kwargs)
      File "/home/farm/farmcatenv/lib64/python3.6/site-packages/pandas/io/parquet.py", line 127, in read
      path, columns=columns, **kwargs
      File "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py", line 1704, in read_table
      use_pandas_metadata=use_pandas_metadata)
      File "/home/farm/farmcatenv/lib64/python3.6/site-packages/pyarrow/parquet.py", line 1582, in read
      use_threads=use_threads
      File "pyarrow/_dataset.pyx", line 372, in pyarrow._dataset.Dataset.to_table
      File "pyarrow/_dataset.pyx", line 2266, in pyarrow._dataset.Scanner.to_table
      File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
      OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
      Deserializing page header failed.
      }}

      Attachments

        1. metadata.json
          5 kB
          Daniel Evans

        Activity

          People

            Unassigned Unassigned
            DanielEvans Daniel Evans
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: