Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-4511

HDDS-4511: Avoiding StaleNodeHandler to take effect in TestDeleteWithSlowFollower.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.1.0
    • 1.1.0
    • SCM

    Description

      This improvement is inspired by the fixing of TestDeleteWithSlowFollower in the broken HDDS-2823.

       

      In the test case TestDeleteWithSlowFollower, there is following trace appearing in the log

      2020-11-24 19:32:13,551 [EventQueue-StaleNodeForStaleNodeHandler] INFO  node.StaleNodeHandler (StaleNodeHandler.java:onMessage(58)) - Datanode 132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, networkLocation: /default-rack, certSerialId: null} moved to stale state. Finalizing its pipelines [PipelineID=6f0e173c-b5e2-4dc6-99e1-854aafdc8295, PipelineID=c78bc2fb-dca1-4e09-ba71-dd824e2d4e73]2020-11-24 19:32:13,552 [EventQueue-StaleNodeForStaleNodeHandler] INFO  pipeline.SCMPipelineManager (PipelineManagerV2Impl.java:closePipeline(389)) - Pipeline Pipeline[ Id: 6f0e173c-b5e2-4dc6-99e1-854aafdc8295, Nodes: 132e6d1b-e472-449e-929e-5f42b87114c6{ip: 10.73.23.64, host: 10.73.23.64, networkLocation: /default-rack, certSerialId: null}46a77559-9d5c-4a1d-bad7-e7eb7b9c32da{ip: 10.73.23.64, host: 10.73.23.64, networkLocation: /default-rack, certSerialId: null}524fea63-ad85-4a3a-bcfb-ac40dfe3d5e7{ip: 10.73.23.64, host: 10.73.23.64, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:OPEN, leaderId:46a77559-9d5c-4a1d-bad7-e7eb7b9c32da, CreationTimestamp2020-11-24T11:30:23.805Z] moved to CLOSED state
      
      

       

       But by design of this case, the stale node handler should not take effect.

      // Make the stale, dead and server failure timeout higher so that a dead
      // node is not detecte at SCM as well as the pipeline close action
      // never gets initiated early at Datanode in the test.

       

      This test case relies on ReplicationManager to close the OPEN container in SCM, so that SCM won't hold the delete blocks command. 

      It can send out the close container command either because it is an OPEN container but under replicate or it is an OPEN container but it has CLOSED replica.

      Since the default interval of RM is 5m, the test case actually relies the "it is an OPEN container but under replicate" to avoid trigger the stale node handler..

       

      But the command disappears, since ReplicationManager#isContainerUnderReplicated does not consider OPEN container, it only take care of CLOSED and QUASI_CLOSED container.

       

      After talked with Sammi, by design, it just needs to explicitly avoid replicating container in DELETING or DELETED state.

      Attachments

        Issue Links

          Activity

            People

              glengeng Glen Geng
              glengeng Glen Geng
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: