[ARROW-5318] [Python] pyarrow hdfs reader overrequests - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Duplicate
Affects Version/s: 0.10.0
Fix Version/s: 0.14.0
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/21780

Description

I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead.

The pyarrow parquet reader doesn't have this behavior, and I am looking for a way to turn off read ahead for the general HDFS interface.

I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 (newest released version). I am on python 2.7

I have been using wireshark to track the packets passed on the network.

I suspect it is read ahead since the time for the 1st read is much greater than the time for 2nd read.

The regular pyarrow reader

import pyarrow as pa 
fs = pa.hdfs.connect(hostname, driver='libhdfs') 
file_path = 'dataset/train/piece0000' 
f = fs.open(file_path) 
f.seek(0) 
n_bytes = 3000000 
f.read(n_bytes)

Parquet code without the same issue

parquet_file = 'dataset/train/parquet/part-22e3' 
pf = fs.open(parquet_path) 
pqf = pa.parquet.ParquetFile(pf)
data = pqf.read_row_group(0, columns=['col_name'])

Attachments

Issue Links

is duplicated by

ARROW-5432 [Python] Add 'read_at' method to pyarrow.NativeFile

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ivan Dimitrov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/May/19 19:08

Updated:: 11/Jan/23 07:39

Resolved:: 27/Jun/19 06:36