Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7574

[Rust] FileSource read implementation is seeking for each single byte

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.16.0
    • None
    • Rust
    • None

    Description

      on current master branch

      $ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet
      ...
      lseek(3, -8, SEEK_END)                  = 2937
      read(3, ",\10\0\0PAR1", 8192)           = 8
      lseek(3, 845, SEEK_SET)                 = 845
      read(3, "\25\2\31\334H schema"..., 8192) = 2100
      ...
      lseek(5, 4, SEEK_SET)                   = 4
      read(5, "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000"..., 8192) = 2941
      lseek(5, 5, SEEK_SET)                   = 5
      read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000"..., 8192) = 2940
      lseek(5, 6, SEEK_SET)                   = 6
      read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000"..., 8192) = 2939
      lseek(5, 7, SEEK_SET)                   = 7
      read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000000"..., 8192) = 2938
      lseek(5, 8, SEEK_SET)                   = 8
      read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000000"..., 8192) = 2937
      lseek(5, 9, SEEK_SET)                   = 9
      read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004"..., 8192) = 2936
      lseek(5, 10, SEEK_SET)                  = 10
      read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004\30"..., 8192) = 2935
      

       Notice the seek position being incremented by one, despite reading up to 8192 bytes at a time. Interestingly this does not seem to have a big performance impact on a local file system with linux, but becomes a problem when working with a custom implementation of ParquetReader, for example for reading from s3.

      The problem seems to be in

      impl<R: ParquetReader> Read for FileSource<R>
      

      which is unconditionally calling

      reader.seek(SeekFrom::Start(self.start as u64))?
      

      Instead it should probably keep track of the current position and only seek on the first read.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jhorstmann Jörn Horstmann
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: