Details
Description
The DecommissionManager.Monitor runs as an executor scheduled task. If an exception occurs, all decommissioning ceases until the NN is restarted. Per javadoc for executor#scheduleAtFixedRate: If any execution of the task encounters an exception, subsequent executions are suppressed. The monitor thread is alive but blocked waiting for an executor task that will never come. The code currently disposes of the future so the actual exception that aborted the task is gone.
Failover is insufficient since the task is also likely dead on the standby. Replication queue init after the transition to active will fix the under replication of blocks on currently decommissioning nodes but future nodes never decommission. The standby must be bounced prior to failover – and hopefully the error condition does not reoccur.
Attachments
Attachments
Issue Links
- is broken by
-
HDFS-7411 Refactor and improve decommissioning logic into DecommissionManager
- Closed
- is duplicated by
-
HDFS-15069 DecommissionMonitor-0 thread will block forever while its timer task scheduled encountered any unchecked exceptions.
- Resolved
- is related to
-
HDFS-14672 Backport HDFS-12703 to branch-2
- Resolved
- relates to
-
HDFS-12704 FBR may corrupt block state
- Open