[HDFS-347] DFS read performance suboptimal when client co-located on nodes with data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1.0-beta
Component/s: datanode, hdfs-client, performance
Labels:
None

Target Version/s:

2.1.0-beta
Hadoop Flags:

Reviewed

Description

One of the major strategies Hadoop uses to get scalable data processing is to move the code to the data. However, putting the DFS client on the same physical node as the data blocks it acts on doesn't improve read performance as much as expected.

After looking at Hadoop and O/S traces (via ~~HADOOP-4049~~), I think the problem is due to the HDFS streaming protocol causing many more read I/O operations (iops) than necessary. Consider the case of a DFSClient fetching a 64 MB disk block from the DataNode process (running in a separate JVM) running on the same machine. The DataNode will satisfy the single disk block request by sending data back to the HDFS client in 64-KB chunks. In BlockSender.java, this is done in the sendChunk() method, relying on Java's transferTo() method. Depending on the host O/S and JVM implementation, transferTo() is implemented as either a sendfilev() syscall or a pair of mmap() and write(). In either case, each chunk is read from the disk by issuing a separahitting te I/O operation for each chunk. The result is that the single request for a 64-MB block ends up the disk as over a thousand smaller requests for 64-KB each.

Since the DFSClient runs in a different JVM and process than the DataNode, shuttling data from the disk to the DFSClient also results in context switches each time network packets get sent (in this case, the 64-kb chunk turns into a large number of 1500 byte packet send operations). Thus we see a large number of context switches for each block send operation.

I'd like to get some feedback on the best way to address this, but I think providing a mechanism for a DFSClient to directly open data blocks that happen to be on the same machine. It could do this by examining the set of LocatedBlocks returned by the NameNode, marking those that should be resident on the local host. Since the DataNode and DFSClient (probably) share the same hadoop configuration, the DFSClient should be able to find the files holding the block data, and it could directly open them and send data back to the client. This would avoid the context switches imposed by the network layer, and would allow for much larger read buffers than 64KB, which should reduce the number of iops imposed by each read block operation.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

a.patch
04/Apr/13 23:33
413 kB
Tsz-wo Sze
2013-04-01-jenkins.patch
01/Apr/13 18:43
424 kB
Colin McCabe
2013.02.15.consolidated4.patch
16/Feb/13 02:12
366 kB
Colin McCabe
2013.01.31.consolidated2.patch
31/Jan/13 19:29
402 kB
Colin McCabe
2013.01.31.consolidated.patch
31/Jan/13 18:59
402 kB
Colin McCabe
2013.01.28.design.pdf
29/Jan/13 03:53
72 kB
Colin McCabe
full.patch
28/Jan/13 19:10
379 kB
Colin McCabe
hdfs-347-merge.txt
23/Jan/13 19:21
357 kB
Todd Lipcon
hdfs-347-merge.txt
15/Jan/13 00:44
349 kB
Todd Lipcon
hdfs-347-merge.txt
14/Jan/13 21:38
346 kB
Todd Lipcon
HDFS-347.035.patch
28/Dec/12 23:55
343 kB
Colin McCabe
HDFS-347.033.patch
19/Dec/12 09:29
321 kB
Colin McCabe
HDFS-347.030.patch
17/Dec/12 21:32
260 kB
Colin McCabe
HDFS-347.029.patch
16/Dec/12 00:32
259 kB
Colin McCabe
HDFS-347.027.patch
15/Dec/12 02:58
261 kB
Colin McCabe
HDFS-347.026.patch
14/Dec/12 02:44
240 kB
Colin McCabe
HDFS-347.025.patch
14/Nov/12 19:06
240 kB
Colin McCabe
HDFS-347.024.patch
14/Nov/12 00:57
240 kB
Colin McCabe
HDFS-347.022.patch
09/Nov/12 22:11
249 kB
Colin McCabe
HDFS-347.021.patch
06/Nov/12 19:58
248 kB
Colin McCabe
HDFS-347.020.patch
06/Nov/12 03:48
245 kB
Colin McCabe
HDFS-347.019.patch
10/Oct/12 20:03
247 kB
Colin McCabe
HDFS-347.018.patch2
08/Oct/12 18:20
246 kB
Colin McCabe
HDFS-347.018.clean.patch
03/Oct/12 22:03
112 kB
Colin McCabe
HDFS-347.017.patch
03/Oct/12 18:33
247 kB
Colin McCabe
HDFS-347.017.clean.patch
03/Oct/12 18:28
113 kB
Colin McCabe
HDFS-347.016.patch
01/Oct/12 20:59
250 kB
Colin McCabe
HDFS-347-016_cleaned.patch
01/Oct/12 20:59
109 kB
Colin McCabe
HDFS-347-branch-20-append.txt
09/Feb/11 23:58
27 kB
ryan rawson
BlockReaderLocal1.txt
08/Feb/11 00:17
28 kB
Dhruba Borthakur
hdfs-347.png
21/Dec/09 04:48
16 kB
Todd Lipcon
all.tsv
21/Dec/09 04:48
12 kB
Todd Lipcon
hdfs-347.txt
13/Oct/09 01:28
26 kB
Todd Lipcon
local-reads-doc
07/Oct/09 06:57
14 kB
Todd Lipcon
HADOOP-4801.3.patch
17/Dec/08 22:17
15 kB
George Porter
HADOOP-4801.2.patch
17/Dec/08 00:24
15 kB
George Porter
HADOOP-4801.1.patch
13/Dec/08 03:51
14 kB
George Porter

Issue Links

blocks

HBASE-8337 Investigate why disabling hadoop short circuit read is required to make recovery tests pass consistently under hadoop2

Closed

contains

HADOOP-9983 SocketInputStream class (org.apache.hadoop.net.SocketInputStream) is not public class in 2.0.5-alpha

Resolved

depends upon

HADOOP-6311 Add support for unix domain sockets to JNI libs

Resolved

is depended upon by

ACCUMULO-884 Insight into short circuit read for local files

Resolved

is related to

HDFS-1599 Umbrella Jira for Improving HBASE support in HDFS

Open

HDFS-6699 Secure Windows DFS read when client co-located on nodes with data (short-circuit reads)

Open

HBASE-3529 Add search to HBase

Closed

HDFS-2246 Shortcut a local client reads to a Datanodes files directly

Closed

relates to

HDFS-4284 BlockReaderLocal not notified of failed disks

Open

HADOOP-3205 Read multiple chunks directly from FSInputChecker subclass into user buffers

Closed

HDFS-385 Design a pluggable interface to place replicas of blocks in HDFS

Closed

(3 is related to, 3 relates to)

Sub-Tasks

1.	Encapsulate arguments to BlockReaderFactory in a class	Resolved	Colin McCabe
2.	Encapsulate connections to peers in Peer and PeerServer classes	Resolved	Colin McCabe
3.	Create DomainSocket and DomainPeer and associated unit tests	Resolved	Colin McCabe
4.	BlockReaderLocal should use passed file descriptors rather than paths	Resolved	Colin McCabe
5.	DomainSocket should throw AsynchronousCloseException when appropriate	Resolved	Colin McCabe
6.	Bypass UNIX domain socket unit tests when they cannot be run	Resolved	Colin McCabe
7.	DFSInputStream#getBlockReader: last retries should ignore the cache	Resolved	Colin McCabe
8.	Fix bug in DomainSocket path validation	Resolved	Colin McCabe
9.	some small DomainSocket fixes: avoid findbugs warning, change log level, etc.	Resolved	Colin McCabe
10.	change dfs.datanode.domain.socket.path to dfs.domain.socket.path	Resolved	Colin McCabe
11.	HDFS-347: fix case where local reads get disabled incorrectly	Resolved	Colin McCabe
12.	HDFS-347: increase default FileInputStreamCache size	Resolved	Todd Lipcon
13.	make TestPeerCache not flaky	Resolved	Colin McCabe
14.	TestDomainSocket fails when system umask is set to 0002	Resolved	Colin McCabe
15.	avoid annoying log message when dfs.domain.socket.path is not set	Resolved	Colin McCabe
16.	Make a simple doc to describe the usage and design of the shortcircuit read feature	Resolved	Colin McCabe
17.	DataNode: don't create domain socket unless we need it	Resolved	Colin McCabe
18.	HDFS-347: style cleanups	Resolved	Colin McCabe
19.	HDFS-347: DN should chmod socket path a+w	Closed	Colin McCabe
20.	DFSClient: don't create a domain socket unless we need it	Resolved	Colin McCabe
21.	allow use of legacy blockreader	Resolved	Colin McCabe
22.	fix various bugs in short circuit read	Closed	Colin McCabe

Activity

People

Assignee:: Colin McCabe

Reporter:: George Porter

Votes:: 12 Vote for this issue

Watchers:: 111 Start watching this issue

Dates

Created:: 07/Dec/08 19:34

Updated:: 30/Nov/16 03:30

Resolved:: 16/May/13 07:11