rhel6 linux 2.6.32-279 (x86_64)
hadoop CDH5.1.2, HA (2) federated (2) NN configuration
large production cluster
On large clusters we are seeing various forms of HDFS reads hanging:
Queries that never return.
Major compactions that hang.
Accumulo 1.6.1 incorporates detectors that report hanging major compactions and a monitor display that reports scans by age.
Stack traces show readers in sun.nio.ch.EPollArrayWrapper.epollWait and in org.apache.hadoop.ipc.Client.Call(Client.java:1362).
Netstat results for the tablet server shows many connections with a single byte waiting on the Recv-Q of the process, and no bytes waiting on the Send-Q.
strace of the jvm shows the typical jvm thread noise (futex calls)
jstack shows lots of read-requests to the NN.
long-running MajC's do complete, albeit slowly.
HDFS-7005 DFS input streams do not timeout