[HDFS-6607] Improve DFSInputStream forward seek performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.1
Fix Version/s: None
Component/s: hdfs-client, performance
Labels:
None

Description

When having a DFSInputStream open and seeking to a position that resides in the same block, if the target position is in the TCP buffer already, the seek is performed efficiently simply by eating up the intervening data. See line 1368 in the file: hadoop-common/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java.

However, if the position is in the same block but after the TCP buffer, the inputstream performs a set of actions including closing the current block reader, locating the block again, selecting a data node and creating a new block reader. During this, many objects are created and all of this is very inefficient for users with random access needs (e.g index access).

I have conducted some experiments which showed that reading 3,000,000 records using seeks and reads is slower than reading 60,000,000 records using seeks and reads as well which shows the need to improve the seek implementation.

Attachments

Issue Links

relates to

HDFS-9146 HDFS forward seek() within a block shouldn't spawn new TCP Peer/RemoteBlockReader

Open

Activity

People

Assignee:: Unassigned

Reporter:: Abdullah Alamoudi

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 30/Jun/14 09:17

Updated:: 25/Sep/15 19:41