[HDFS-15069] DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions. - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.1.3
Fix Version/s: None
Component/s: namenode
Labels:
None

Description

More than once, we have observed that during decommissioning of a large number of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a long time, and there will be no exception logs or notifications at all.

e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days.

The stack of DecommissionMonitor-0 looks like this:

stack on 2019.12.17 16:12
stack on 2019.12.17 16:42

It can be seen that during half an hour, this thread has not been scheduled at all, its Waited count has not changed.

We think the cause of the problem is:

The DecommissionMonitor task submitted by NameNode encounters an unchecked exception during its running , and then this task will be never executed again.
But NameNode does not care about the ScheduledFuture of this task, and never calls ScheduledFuture.get(), so the unchecked exception thrown by the task above will always be placed there, no one knows.

After that, the subsequent phenomenon is:

The ScheduledExecutorService thread DecommissionMonitor-0 will block forever in ThreadPoolExecutor.getTask().
The previously submitted task DecommissionMonitor will be never executed again.
No logs or notifications can let us know exactly what had happened.

Possible solutions:

Do not use thread pool to execute decommission monitor task, alternatively we can introduce a separate thread to do this, just like HeartbeatManager, ReplicationMonitor, LeaseManager, BlockReportThread, and so on.

2. Catch all exceptions in decommission monitor task's run() method, so it does not throw any exceptions.

I prefer the second option.

Attachments

stack_on_16_42.png
18/Dec/19 08:30
19 kB
Xudong Cao
stack_on_16_12.png
18/Dec/19 08:30
17 kB
Xudong Cao

Issue Links

Add Link

duplicates

HDFS-12703 Exceptions are fatal to decommissioning monitor

Resolved

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Xudong Cao Assign to me

Reporter:: Xudong Cao

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 18/Dec/19 08:32

Updated:: 19/Dec/19 01:37

Resolved:: 19/Dec/19 01:36

Agile

View on Board

DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment