[HDFS-223] Asynchronous IO Handling in Hadoop and HDFS - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

I think Hadoop needs utilities or framework to make it simpler to deal with generic asynchronous IO in Hadoop.

Example use case :

Its been a long standing problem that DataNode takes too many threads for data transfers. Each write operation takes up 2 threads at each of the datanodes and each read operation takes one irrespective of how much activity is on the sockets. The kinds of load that HDFS serves has been expanding quite fast and HDFS should handle these varied loads better. If there is a framework for non-blocking IO, read and write pipeline state machines could be implemented with async events on a fixed number of threads.

A generic utility is better since it could be used in other places like DFSClient. DFSClient currently creates 2 extra threads for each file it has open for writing.

Initially I started writing a primitive "selector", then tried to see if such facility already exists. Apache MINA seemed to do exactly this. My impression after looking the the interface and examples is that it does not give kind control we might prefer or need. First use case I was thinking of implementing using MINA was to replace "response handlers" in DataNode. The response handlers are simpler since they don't involve disk I/O. I asked on MINA user list, but looks like it can not be done, I think mainly because the sockets are already created.

Essentially what I have in mind is similar to MINA, except that read and write of the sockets is done by the event handlers. The lowest layer essentially invokes selectors, invokes event handlers on single or on multiple threads. Each event handler is is expected to do some non-blocking work. We would of course have utility handler implementations that do read, write, accept etc, that are useful for simple processing.

Sam Pullara mentioned that xSockets is more flexible. It is under GPL.

Are there other such implementations we should look at?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MinaEchoServer.patch
06/Aug/08 18:58
43 kB
Raghu Angadi
GrizzlyEchoServer.patch
07/Aug/08 22:54
5 kB
Raghu Angadi

Issue Links

is related to

HDFS-249 RPC support for large data transfers.

Open

HDFS-1599 Umbrella Jira for Improving HBASE support in HDFS

Open

HBASE-14790 Implement a new DFSOutputStream for logging WAL only

Closed

relates to

HADOOP-3859 1000 concurrent read on a single file failing the task/client

Closed

HDFS-916 Rewrite DFSOutputStream to use a single thread with NIO

Open

Activity

People

Assignee:: Unassigned

Reporter:: Raghu Angadi

Votes:: 7 Vote for this issue

Watchers:: 63 Start watching this issue

Dates

Created:: 29/Jul/08 23:30

Updated:: 11/May/16 02:04