Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.16.0
-
None
-
None
Description
on current master branch
$ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet ... lseek(3, -8, SEEK_END) = 2937 read(3, ",\10\0\0PAR1", 8192) = 8 lseek(3, 845, SEEK_SET) = 845 read(3, "\25\2\31\334H schema"..., 8192) = 2100 ... lseek(5, 4, SEEK_SET) = 4 read(5, "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000"..., 8192) = 2941 lseek(5, 5, SEEK_SET) = 5 read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000"..., 8192) = 2940 lseek(5, 6, SEEK_SET) = 6 read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000"..., 8192) = 2939 lseek(5, 7, SEEK_SET) = 7 read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000000"..., 8192) = 2938 lseek(5, 8, SEEK_SET) = 8 read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000000"..., 8192) = 2937 lseek(5, 9, SEEK_SET) = 9 read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004"..., 8192) = 2936 lseek(5, 10, SEEK_SET) = 10 read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004\30"..., 8192) = 2935
Notice the seek position being incremented by one, despite reading up to 8192 bytes at a time. Interestingly this does not seem to have a big performance impact on a local file system with linux, but becomes a problem when working with a custom implementation of ParquetReader, for example for reading from s3.
The problem seems to be in
impl<R: ParquetReader> Read for FileSource<R>
which is unconditionally calling
reader.seek(SeekFrom::Start(self.start as u64))?
Instead it should probably keep track of the current position and only seek on the first read.
Attachments
Issue Links
- is fixed by
-
ARROW-7681 [Rust] Explicitly seeking a BufReader will discard the internal buffer
-
- Resolved
-