Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13655

[C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14

    XMLWordPrintableJSON

Details

    Description

      From https://github.com/dask/dask/issues/8027

      Apache Thrift introduced a `MaxMessageSize` configuration option (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) in version 0.14 (THRIFT-5237).

      I think this is the cause of an issue reported originally at https://github.com/dask/dask/issues/8027, where one can get a "OSError: Couldn't deserialize thrift: MaxMessageSize reached" error while reading a large Parquet (metadata-only) file.

      In the original report, the file was writting using the python fastparquet library (which uses the python thrift bindings, which still use Thrift 0.13), but I was able to construct a reproducible code example with pyarrow.

      Create a large metadata Parquet file with pyarrow in an environment with Arrow built against Thrift 0.13 (eg with a local install from source, or installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13):

      import pyarrow as pa
      import pyarrow.parquet as pq
      
      table = pa.table({str(i): np.random.randn(10) for i in range(1_000)})
      pq.write_table(table, "__temp_file_for_metadata.parquet")
      metadata = pq.read_metadata("__temp_file_for_metadata.parquet")
      metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet")
      
      [metadata.append_row_groups(metadata2) for _ in range(4000)]
      metadata.write_metadata_file("test_parquet_metadata_large_file.parquet")
      

      And then reading this file again in the same environment works fine, but reading it in an environment with recent Thrift 0.14 (eg installing latest pyarrow with conda-forge) gives the following error:

      In [1]: import pyarrow.parquet as pq
      
      In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet")
      ...
      OSError: Couldn't deserialize thrift: MaxMessageSize reached
      

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 10m
                  4h 10m