Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10056

[C++] Increase flatbuffers max_tables parameter in order to read wide tables

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.1
    • 4.0.0
    • Python
    • CentOS7
      conda environment with pyarrow 1.0.1, numpy 1.19.1 and pandas 1.1.1

    Description

      pyarrow writes an invalid Feather v2 file, which it can't read afterwards.

          OSError: Verification of flatbuffer-encoded Footer failed.
      

      The following code reproduces the problem for me:

      import pyarrow as pa
      import numpy as np
      import pandas as pd
      
      nbr_regions = 1223024
      nbr_motifs = 4891
      
      # Create (big) dataframe.
      df = pd.DataFrame(
          np.arange(nbr_regions * nbr_motifs, dtype=np.float32).reshape((nbr_regions, nbr_motifs)),
          index=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
          columns=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs')
      )
      
      # Transpose dataframe
      df_transposed = df.transpose()
      
      # Write transposed dataframe to Feather v2 format.
      pf.write_feather(df_transposed, 'df_transposed.feather')
      
      # Trying to read the transposed dataframe from Feather v2 format, results in this error:
      df_transposed_read = pf.read_feather('df_transposed.feather')
      
      ---------------------------------------------------------------------------
      OSError                                   Traceback (most recent call last)
      <ipython-input-64-b41ad5157e77> in <module>
      ----> 1 df_transposed_read = pf.read_feather('df_transposed.feather')
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
          213     """
          214     _check_pandas_version()
      --> 215     return (read_table(source, columns=columns, memory_map=memory_map)
          216             .to_pandas(use_threads=use_threads))
          217
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
          235     """
          236     reader = ext.FeatherReader()
      --> 237     reader.open(source, use_memory_map=memory_map)
          238
          239     if columns is None:
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      OSError: Verification of flatbuffer-encoded Footer failed.
      

      Later I discovered that it happens also if the original dataframe is created in the transposed order:

      # Create (big) dataframe.
      df_without_transpose = pd.DataFrame(
          np.arange(nbr_motifs * nbr_regions, dtype=np.float32).reshape((nbr_motifs, nbr_regions)),
          index=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs'),
          columns=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
      )
      
      pf.write_feather(df_without_transpose, 'df_without_transpose.feather')
      
      df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
      ---------------------------------------------------------------------------
      OSError                                   Traceback (most recent call last)
      <ipython-input-91-3cdad1d58c35> in <module>
      ----> 1 df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
          213     """
          214     _check_pandas_version()
      --> 215     return (read_table(source, columns=columns, memory_map=memory_map)
          216             .to_pandas(use_threads=use_threads))
          217
      
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
          235     """
          236     reader = ext.FeatherReader()
      --> 237     reader.open(source, use_memory_map=memory_map)
          238
          239     if columns is None:
      
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
      
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
      
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      OSError: Verification of flatbuffer-encoded Footer failed.
      

      Writing to Feather v1 format works:

      pf.write_feather(df_transposed, 'df_transposed.v1.feather', version=1)
      
      df_transposed_read_v1 = pf.read_feather('df_transposed.v1.feather')
      
      # Now do the same, but also save the index in the Feather v1 file.
      df_transposed_reset_index = df_transposed.reset_index()
      
      pf.write_feather(df_transposed_reset_index, 'df_transposed_reset_index.v1.feather', version=1)
      
      df_transposed_reset_index_read_v1 = pf.read_feather('df_transposed_reset_index.v1.feather')
      
      # Returns True
      df_transposed_reset_index_read_v1.equals(df_transposed)
      

      Attachments

        Issue Links

          Activity

            People

              ghuls Gert Hulselmans
              ghuls Gert Hulselmans
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 40m
                  3h 40m