Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10056

[C++] Increase flatbuffers max_tables parameter in order to read wide tables

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.1
    • 4.0.0
    • Python
    • CentOS7
      conda environment with pyarrow 1.0.1, numpy 1.19.1 and pandas 1.1.1

    Description

      pyarrow writes an invalid Feather v2 file, which it can't read afterwards.

          OSError: Verification of flatbuffer-encoded Footer failed.
      

      The following code reproduces the problem for me:

      import pyarrow as pa
      import numpy as np
      import pandas as pd
      
      nbr_regions = 1223024
      nbr_motifs = 4891
      
      # Create (big) dataframe.
      df = pd.DataFrame(
          np.arange(nbr_regions * nbr_motifs, dtype=np.float32).reshape((nbr_regions, nbr_motifs)),
          index=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
          columns=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs')
      )
      
      # Transpose dataframe
      df_transposed = df.transpose()
      
      # Write transposed dataframe to Feather v2 format.
      pf.write_feather(df_transposed, 'df_transposed.feather')
      
      # Trying to read the transposed dataframe from Feather v2 format, results in this error:
      df_transposed_read = pf.read_feather('df_transposed.feather')
      
      ---------------------------------------------------------------------------
      OSError                                   Traceback (most recent call last)
      <ipython-input-64-b41ad5157e77> in <module>
      ----> 1 df_transposed_read = pf.read_feather('df_transposed.feather')
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
          213     """
          214     _check_pandas_version()
      --> 215     return (read_table(source, columns=columns, memory_map=memory_map)
          216             .to_pandas(use_threads=use_threads))
          217
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
          235     """
          236     reader = ext.FeatherReader()
      --> 237     reader.open(source, use_memory_map=memory_map)
          238
          239     if columns is None:
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      OSError: Verification of flatbuffer-encoded Footer failed.
      

      Later I discovered that it happens also if the original dataframe is created in the transposed order:

      # Create (big) dataframe.
      df_without_transpose = pd.DataFrame(
          np.arange(nbr_motifs * nbr_regions, dtype=np.float32).reshape((nbr_motifs, nbr_regions)),
          index=pd.Index(['motif' + str(i) for i in range(nbr_motifs)], name='motifs'),
          columns=pd.Index(['region' + str(i) for i in range(nbr_regions)], name='regions'),
      )
      
      pf.write_feather(df_without_transpose, 'df_without_transpose.feather')
      
      df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
      ---------------------------------------------------------------------------
      OSError                                   Traceback (most recent call last)
      <ipython-input-91-3cdad1d58c35> in <module>
      ----> 1 df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_feather(source, columns, use_threads, memory_map)
          213     """
          214     _check_pandas_version()
      --> 215     return (read_table(source, columns=columns, memory_map=memory_map)
          216             .to_pandas(use_threads=use_threads))
          217
      
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py in read_table(source, columns, memory_map)
          235     """
          236     reader = ext.FeatherReader()
      --> 237     reader.open(source, use_memory_map=memory_map)
          238
          239     if columns is None:
      
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi in pyarrow.lib.FeatherReader.open()
      
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
      
      
      /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
      
      OSError: Verification of flatbuffer-encoded Footer failed.
      

      Writing to Feather v1 format works:

      pf.write_feather(df_transposed, 'df_transposed.v1.feather', version=1)
      
      df_transposed_read_v1 = pf.read_feather('df_transposed.v1.feather')
      
      # Now do the same, but also save the index in the Feather v1 file.
      df_transposed_reset_index = df_transposed.reset_index()
      
      pf.write_feather(df_transposed_reset_index, 'df_transposed_reset_index.v1.feather', version=1)
      
      df_transposed_reset_index_read_v1 = pf.read_feather('df_transposed_reset_index.v1.feather')
      
      # Returns True
      df_transposed_reset_index_read_v1.equals(df_transposed)
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ghuls Gert Hulselmans
            ghuls Gert Hulselmans
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 40m
                3h 40m

                Slack

                  Issue deployment