Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
8.0.0
-
Ubuntu 20.04, pandas 1.4.2
Description
When loading parquet file "OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit" is raised. This seems to be related to memory usage of table header. The issue may be coming from C code part. Also pyarrow 0.16 version is capable to read that parquet file.
Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook with more details is in attachments.
Code snippet creates 2 pandas dataframes which only differ in column names. One with short column names is stored and read without problem while the other dataframe with long column names is stored but raises Exception during reading.
import pandas as pd import numpy as np data = np.random.randn(10, 250000) index = range(10) short_column_names = [f"col_{i}" for i in range(250000)] long_column_names = [f"some_really_long_column_name_ending_with_integer_number_{i}" for i in range(250000)] df_short_cols = pd.DataFrame(columns=short_column_names, data=data, index=index) df_long_cols = pd.DataFrame(columns=long_column_names, data=data, index=index)# Identical dataframes only column names are different # Storing dataframe with long column names works OK but reading fails df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") # <--- Fails here
Attachments
Attachments
Issue Links
- links to