Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16546

[Python] Pyarrow fails to loads parquet file with long column names

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 8.0.0
    • 9.0.0
    • C++, Parquet, Python
    • Ubuntu 20.04, pandas 1.4.2

    Description

      When loading parquet file "OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit" is raised. This seems to be related to memory usage of table header. The issue may be coming from C code part. Also pyarrow 0.16 version is capable to read that parquet file.

      Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook with more details is in attachments.

      Code snippet creates 2 pandas dataframes which only differ in column names. One with short column names is stored and read without problem while the other dataframe with long column names is stored but raises Exception during reading.

      import pandas as pd
      import numpy as np
      
      data = np.random.randn(10, 250000)
      index = range(10)
      short_column_names = [f"col_{i}" for i in range(250000)]
      long_column_names = [f"some_really_long_column_name_ending_with_integer_number_{i}" for i in range(250000)]
      
      df_short_cols = pd.DataFrame(columns=short_column_names, data=data, index=index)
      df_long_cols = pd.DataFrame(columns=long_column_names, data=data, index=index)# Identical dataframes only column names are different
      
      # Storing dataframe with long column names works OK but reading fails
      df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
      df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") # <--- Fails here

       

       

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              urmanbm Boris Urman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3.5h
                  3.5h