Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3098

[Python] BufferReader doesn't adhere to the seek protocol

    Details

      Description

      I have a script that creates a Parquet file and then writes it out to a BufferOutputStream and then into a BufferReader with the intention of passing it to a place that takes a file-like object to upload it somewhere else. But the other location relies on being able to seek to the end of the file to figure out how big the file is, e.g.

      reader.seek(0, 2)
      size = reader.tell()
      reader.seek(0)
      

       

      But when I do that the following exception is raised: 

       

      pyarrow/io.pxi:209: in pyarrow.lib.NativeFile.seek
      ???
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
      
      > ???
      E pyarrow.lib.ArrowIOError: position out of bounds
      

      I compared it to casting to an io.BytesIO instead which works:

      import io
      
      import pyarrow as pa
      
      
      def test_arrow_output_stream():
          output = pa.BufferOutputStream()
          output.write(b'hello')
      
          reader = pa.BufferReader(output.getvalue())
      
          reader.seek(0, 2)
          assert reader.tell() == 5
      
      
      def test_python_io_stream():
          output = pa.BufferOutputStream()
          output.write(b'hello')
      
          buffer = io.BytesIO(output.getvalue().to_pybytes())
          reader = io.BufferedRandom(buffer)
      
          reader.seek(0, 2)
          assert reader.tell() == 5
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                pitrou Antoine Pitrou
                Reporter:
                gaqzi Björn Andersson
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h