[ARROW-16546] [Python] Pyarrow fails to loads parquet file with long column names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 8.0.0
Fix Version/s: 9.0.0
Component/s: C++, Parquet, Python
Labels:
- pull-request-available
Environment:
Ubuntu 20.04, pandas 1.4.2

External issue URL:
https://github.com/apache/arrow/issues/31907

Description

When loading parquet file "OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit" is raised. This seems to be related to memory usage of table header. The issue may be coming from C code part. Also pyarrow 0.16 version is capable to read that parquet file.

Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook with more details is in attachments.

Code snippet creates 2 pandas dataframes which only differ in column names. One with short column names is stored and read without problem while the other dataframe with long column names is stored but raises Exception during reading.

import pandas as pd
import numpy as np

data = np.random.randn(10, 250000)
index = range(10)
short_column_names = [f"col_{i}" for i in range(250000)]
long_column_names = [f"some_really_long_column_name_ending_with_integer_number_{i}" for i in range(250000)]

df_short_cols = pd.DataFrame(columns=short_column_names, data=data, index=index)
df_long_cols = pd.DataFrame(columns=long_column_names, data=data, index=index)# Identical dataframes only column names are different

# Storing dataframe with long column names works OK but reading fails
df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") # <--- Fails here

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screenshot from 2022-05-12 16-59-10.png
12/May/22 14:59
867 kB
Boris Urman

Issue Links

links to

GitHub Pull Request #13275

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Boris Urman

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/May/22 15:02

Updated:: 11/Jan/23 11:44

Resolved:: 07/Jun/22 11:29

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3.5h