Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Duplicate
-
0.10.0
-
None
Description
I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead.
The pyarrow parquet reader doesn't have this behavior, and I am looking for a way to turn off read ahead for the general HDFS interface.
I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 (newest released version). I am on python 2.7
I have been using wireshark to track the packets passed on the network.
I suspect it is read ahead since the time for the 1st read is much greater than the time for 2nd read.
The regular pyarrow reader
import pyarrow as pa fs = pa.hdfs.connect(hostname, driver='libhdfs') file_path = 'dataset/train/piece0000' f = fs.open(file_path) f.seek(0) n_bytes = 3000000 f.read(n_bytes)
Parquet code without the same issue
parquet_file = 'dataset/train/parquet/part-22e3' pf = fs.open(parquet_path) pqf = pa.parquet.ParquetFile(pf) data = pqf.read_row_group(0, columns=['col_name'])
Attachments
Issue Links
- is duplicated by
-
ARROW-5432 [Python] Add 'read_at' method to pyarrow.NativeFile
- Resolved