[HDFS-6166] revisit balancer so_timeout - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.3.0, 3.0.0-alpha1
Fix Version/s: 0.23.11, 2.4.0
Component/s: balancer & mover
Labels:
None

Target Version/s:

2.4.0
Hadoop Flags:

Reviewed

Description

~~HDFS-5806~~ changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with "threads quota exceeded errors"). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log.

Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish.

So, in the interest of a small change to deal with both situations, I propose the following two changes:

Crank of the socket read timeout to 20 minutes
Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-6166.patch
27/Mar/14 23:11
4 kB
Nathan Roberts
HDFS-6166-branch23.patch
31/Mar/14 14:29
3 kB
Nathan Roberts

Issue Links

is broken by

HDFS-5806 balancer should set SoTimeout to avoid indefinite hangs

Closed

Activity

People

Assignee:: Nathan Roberts

Reporter:: Nathan Roberts

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 27/Mar/14 20:07

Updated:: 12/May/16 18:12

Resolved:: 29/Mar/14 16:25