Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8694

[Python][Parquet] parquet.read_schema() fails when loading wide table created from Pandas DataFrame

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.17.0
    • 0.17.1, 1.0.0
    • C++, Python
    • Linux OS with RHEL 7.7 distribution

    Description

      parquet.read_schema() fails when loading wide table schema created from Pandas DataFrame with 50,000 columns. This works ok using pyarrow 0.16.0.

      import numpy as np
      import pandas as pd
      import pyarrow as pa
      import pyarrow.parquet as pq
      print(pa.__version__)
      df = pd.DataFrame(({'c' + str(i): np.random.randn(10) for i in range(50000)}))
      table = pa.Table.from_pandas(df)
      pq.write_table(table, "test_wide.parquet")
      schema = pq.read_schema('test_wide.parquet')

      Output:

      0.17.0
      Traceback (most recent call last):
      File "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-29-d5ef2df77263>", line 9, in <module>
      table = pq.read_schema('test_wide.parquet')
      File "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py", line 1793, in read_schema
      return ParquetFile(where, memory_map=memory_map).schema.to_arrow_schema()
      File "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in _init_
      read_dictionary=read_dictionary, metadata=metadata)
      File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open
      File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
      OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit

       

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              ekisslinger Eric Kisslinger
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h