[HDFS-12703] Exceptions are fatal to decommissioning monitor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.7.0
Fix Version/s: 2.10.0, 3.0.4, 3.3.0, 3.2.1, 3.1.3
Component/s: namenode
Labels:
None

Target Version/s:

2.8.6
Hadoop Flags:

Reviewed

Description

The DecommissionManager.Monitor runs as an executor scheduled task. If an exception occurs, all decommissioning ceases until the NN is restarted. Per javadoc for executor#scheduleAtFixedRate: If any execution of the task encounters an exception, subsequent executions are suppressed. The monitor thread is alive but blocked waiting for an executor task that will never come. The code currently disposes of the future so the actual exception that aborted the task is gone.

Failover is insufficient since the task is also likely dead on the standby. Replication queue init after the transition to active will fix the under replication of blocks on currently decommissioning nodes but future nodes never decommission. The standby must be bounced prior to failover – and hopefully the error condition does not reoccur.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-12703.013.patch
10/Jul/19 07:29
12 kB
Xiaoqiao He
HDFS-12703.012.patch
09/Jul/19 17:07
12 kB
Xiaoqiao He
HDFS-12703.011.patch
09/Jul/19 14:03
12 kB
Xiaoqiao He
HDFS-12703.010.patch
09/Jul/19 03:08
12 kB
Xiaoqiao He
HDFS-12703.009.patch
09/Jul/19 00:26
12 kB
Xiaoqiao He
HDFS-12703.008.patch
08/Jul/19 17:31
12 kB
Xiaoqiao He
HDFS-12703.007.patch
08/Jul/19 13:29
12 kB
Xiaoqiao He
HDFS-12703.006.patch
08/Jul/19 04:48
12 kB
Xiaoqiao He
HDFS-12703.005.patch
07/Jul/19 17:46
10 kB
Xiaoqiao He
HDFS-12703.004.patch
27/Jun/19 06:15
8 kB
Xiaoqiao He
HDFS-12703.003.patch
26/Jun/19 14:08
8 kB
Xiaoqiao He
HDFS-12703.002.patch
26/Jun/19 10:16
8 kB
Xiaoqiao He
HDFS-12703.001.patch
29/May/19 00:42
9 kB
Xue Liu

Issue Links

is broken by

HDFS-7411 Refactor and improve decommissioning logic into DecommissionManager

Closed

is duplicated by

HDFS-15069 DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.

Resolved

is related to

HDFS-14672 Backport HDFS-12703 to branch-2

Resolved

relates to

HDFS-12704 FBR may corrupt block state

Open

Activity

People

Assignee:: Xiaoqiao He

Reporter:: Daryn Sharp

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 24/Oct/17 16:18

Updated:: 19/Dec/19 01:35

Resolved:: 10/Jul/19 18:09