[HDDS-4511] HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.1.0
Component/s: SCM
Labels:
- pull-request-available

Description

This improvement is inspired by the fixing of TestDeleteWithSlowFollower in the broken ~~HDDS-2823~~.

In the test case TestDeleteWithSlowFollower, there is following trace appearing in the log

2020-11-24 19:32:13,551 [EventQueue-StaleNodeForStaleNodeHandler] INFO  node.StaleNodeHandler (StaleNodeHandler.java:onMessage(58)) - Datanode 132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, networkLocation: /default-rack, certSerialId: null} moved to stale state. Finalizing its pipelines [PipelineID=6f0e173c-b5e2-4dc6-99e1-854aafdc8295, PipelineID=c78bc2fb-dca1-4e09-ba71-dd824e2d4e73]2020-11-24 19:32:13,552 [EventQueue-StaleNodeForStaleNodeHandler] INFO  pipeline.SCMPipelineManager (PipelineManagerV2Impl.java:closePipeline(389)) - Pipeline Pipeline[ Id: 6f0e173c-b5e2-4dc6-99e1-854aafdc8295, Nodes: 132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, networkLocation: /default-rack, certSerialId: null}46a77559-9d5c-4a1d-bad7-e7eb7b9c32da{ip: 10.73.23.64, host: 10.73.23.64, networkLocation: /default-rack, certSerialId: null}524fea63-ad85-4a3a-bcfb-ac40dfe3d5e7{ip: 10.73.23.64, host: 10.73.23.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:OPEN, leaderId:46a77559-9d5c-4a1d-bad7-e7eb7b9c32da, CreationTimestamp2020-11-24T11:30:23.805Z] moved to CLOSED state

But by design of this case, the stale node handler should not take effect.

// Make the stale, dead and server failure timeout higher so that a dead
// node is not detecte at SCM as well as the pipeline close action
// never gets initiated early at Datanode in the test.

This test case relies on ReplicationManager to close the OPEN container in SCM, so that SCM won't hold the delete blocks command.

It can send out the close container command either because it is an OPEN container but under replicate or it is an OPEN container but it has CLOSED replica.

Since the default interval of RM is 5m, the test case actually relies the "it is an OPEN container but under replicate" to avoid trigger the stale node handler..

But the command disappears, since ReplicationManager#isContainerUnderReplicated does not consider OPEN container, it only take care of CLOSED and QUASI_CLOSED container.

After talked with Sammi, by design, it just needs to explicitly avoid replicating container in DELETING or DELETED state.

Attachments

Issue Links

links to

GitHub Pull Request #1625

Activity

People

Assignee:: Glen Geng

Reporter:: Glen Geng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 25/Nov/20 12:33

Updated:: 28/Feb/23 12:46

Resolved:: 28/Feb/23 12:46