[MAPREDUCE-6107] Job history server becomes unresponsive due to stuck thread in epollWait - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: jobhistoryserver
Labels:
None

Description

About once every week, we see job history server becomes unresponsive on one of our 2000 node hadoop cluster. Looking at the thread dump, I see that multiple threads are blocked on locks acquired by couple of threads, which in turn are endlessly stuck in epollWait while talking to hdfs to get a history file.
When the number of blocked threads touches the thread pool size, JHS becomes unresponsive to new clients requests.
Thread dump attached.

Has anyone seen this before ?

Here is the thread stuck at epollWait.

"IPC Server handler 4 on 10020" daemon prio=10 tid=0x00007f7eb10f5000 nid=0x144d runnable [0x00007f7e9108d000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
        - locked <0x00000006c89d3240> (a sun.nio.ch.Util$2)
        - locked <0x00000006c89d3228> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000006bb32f8b8> (a sun.nio.ch.EPollSelectorImpl)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jstack.log
24/Sep/14 19:27
1.27 MB
Ashwin Shankar

Activity

People

Assignee:: Unassigned

Reporter:: Ashwin Shankar

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Sep/14 19:26

Updated:: 11/Feb/15 22:35