Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
There are a lot of following messages in DN log file, which makes me think there is something wrong with the heartbeat channel between the DN and SCM.
2021-09-28 10:59:36,871 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,345 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,379 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,584 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,598 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,631 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,650 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,652 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,668 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,741 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,918 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,925 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,930 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,954 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,955 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:37,995 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:38,005 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:38,097 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:38,114 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
2021-09-28 10:59:38,174 [Datanode State Machine Thread - 0] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
After add some extra debug info, I found the thing is not as the LOG shows, the heartbeat channel is very healthy and this "No available thread in pool for past 30 seconds" is quite misleading. The corrent part of the LOG is "No available thread in pool", the wrong part is "for the past 30 seconds".
2021-09-29 11:28:01,257 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: heartbeat interval : 30000, next HB 6719157366
2021-09-29 11:28:01,257 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.StateContext: Starting to execute the task 24
2021-09-29 11:28:01,257 [EndpointStateMachine task thread for /9.186.21.245:9861 - 0 ] INFO org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask: Sending heartbeat message
2021-09-29 11:28:01,261 [EndpointStateMachine task thread for /9.186.21.245:9861 - 0 ] INFO org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask: Received heartbeat response
2021-09-29 11:28:01,261 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.StateContext: Finished executing the task 24
2021-09-29 11:28:01,261 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: Sleep 29996 ms
2021-09-29 11:28:05,622 [ChunkWriter-125-0] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [ChunkWriter-71-0] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: heartbeat interval : 30000, next HB 6719161731
2021-09-29 11:28:05,622 [ChunkWriter-165-0] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [ChunkWriter-53-0] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.StateContext: Starting to execute the task 25
2021-09-29 11:28:05,622 [ChunkWriter-31-0] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [ChunkWriter-41-0] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [ChunkWriter-25-0] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,622 [DatanodeStateMachineDaemonThread] ERROR org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: execute task.wait is interrupted.
2021-09-29 11:28:05,623 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: heartbeat interval : 30000, next HB 6719161732
2021-09-29 11:28:05,623 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.StateContext: Starting to execute the task 26
2021-09-29 11:28:05,623 [EndpointStateMachine task thread for /9.186.21.245:9861 - 0 ] INFO org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask: Sending heartbeat message
2021-09-29 11:28:05,641 [EndpointStateMachine task thread for /9.186.21.245:9861 - 0 ] INFO org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask: Received heartbeat response
2021-09-29 11:28:05,642 [EndpointStateMachine task thread for /9.186.21.245:9861 - 0 ] INFO org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask: Sending heartbeat message
2021-09-29 11:28:05,652 [EndpointStateMachine task thread for /9.186.21.245:9861 - 0 ] INFO org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask: Received heartbeat response
2021-09-29 11:28:05,652 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.StateContext: Finished executing the task 26
2021-09-29 11:28:05,652 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: Sleep 29971 ms
2021-09-29 11:28:05,709 [ChunkWriter-73-0] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: triggerHeartbeat: interrupt stateMachineThread for ICR report
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: heartbeat interval : 30000, next HB 6719161818
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.StateContext: Starting to execute the task 27
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] ERROR org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: execute task.wait is interrupted.
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] INFO org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine: heartbeat interval : 30000, next HB 6719161818
2021-09-29 11:28:05,709 [DatanodeStateMachineDaemonThread] WARN org.apache.hadoop.ozone.container.common.statemachine.StateContext: No available thread in pool for past 30 seconds.
Attachments
Issue Links
- links to