[HBASE-10566] cleanup rpcTimeout in the client - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.99.0
Fix Version/s: 0.99.0
Component/s: Client
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
3 new settings are now available to configure the socket in the HBase client:
- connect timeout: "hbase.ipc.client.socket.timeout.connect" (milliseconds, default: 10 seconds)
- read timeout: "hbase.ipc.client.socket.timeout.read" (milliseconds, default: 20 seconds)
- write timeout: "hbase.ipc.client.socket.timeout.write" (milliseconds, default: 60 seconds)

ipc.socket.timeout is not used anymore.
The per operation timeout is still controled by hbase.rpc.timeout

Show
3 new settings are now available to configure the socket in the HBase client: - connect timeout: "hbase.ipc.client.socket.timeout.connect" (milliseconds, default: 10 seconds) - read timeout: "hbase.ipc.client.socket.timeout.read" (milliseconds, default: 20 seconds) - write timeout: "hbase.ipc.client.socket.timeout.write" (milliseconds, default: 60 seconds) ipc.socket.timeout is not used anymore. The per operation timeout is still controled by hbase.rpc.timeout

Description

There are two issues:
1) A confusion between the socket timeout and the call timeout
Socket timeouts should be minimal: a default like 20 seconds, that could be lowered to single digits timeouts for some apps: if we can not write to the socket in 10 second, we have an issue. This is different from the total duration (send query + do query + receive query), that can be longer, as it can include remotes calls on the server and so on. Today, we have a single value, it does not allow us to have low socket read timeouts.
2) The timeout can be different between the calls. Typically, if the total time, retries included is 60 seconds but failed after 2 seconds, then the remaining is 58s. HBase does this today, but by hacking with a thread local storage variable. It's a hack (it should have been a parameter of the methods, the TLS allowed to bypass all the layers. May be protobuf makes this complicated, to be confirmed), but as well it does not really work, because we can have multithreading issues (we use the updated rpc timeout of someone else, or we create a new BlockingRpcChannelImplementation with a random default timeout).

Ideally, we could send the call timeout to the server as well: it will be able to dismiss alone the calls that it received but git stick in the request queue or in the internal retries (on hdfs for example).

This will make the system more reactive to failure.
I think we can solve this now, especially after 10525. The main issue is to something that fits well with protobuf...
Then it should be easy to have a pool of thread for writers and readers, w/o a single thread per region server as today.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

10566.sample.patch
19/Feb/14 18:52
49 kB
Nicolas Liochon
10566.v1.patch
24/Feb/14 16:25
72 kB
Nicolas Liochon
10566.v2.patch
24/Feb/14 18:19
72 kB
Nicolas Liochon
10566.v3.patch
25/Feb/14 11:12
96 kB
Nicolas Liochon

Activity

People

Assignee:: Nicolas Liochon

Reporter:: Nicolas Liochon

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 19/Feb/14 11:16

Updated:: 04/Aug/17 23:35

Resolved:: 25/Feb/14 16:29