Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-2972

Any container replication error can terminate SCM service

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.4.1
    • None
    • SCM

    Description

      I found there any container replication error thrown in ReplicationManager can terminates SCM service. It's a very expensive behavior to terminate the SCM service just because of one container replication error.

      It's not worth to shutdown the SCM. We can be friendly to deal with this, catch the exception and print the warn message with thrown exception.

      The shutdown info:

      2020-01-30 08:16:04,705 ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception in Replication Monitor Thread.
      java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
              at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:789)
              at org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:399)
              at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:249)
              at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:173)
              at org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:515)
              at org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:311)
              at java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
              at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
              at org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:223)
              at java.lang.Thread.run(Thread.java:745)
      2020-01-30 08:16:04,730 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: Affinity node /dc1/rack1/b9343ca0-a4bc-4436-9671-bc1de6c8bd89 is not a member of topology
      2020-01-30 08:16:04,734 INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
      

      Attachments

        Issue Links

          Activity

            People

              linyiqun Yiqun Lin
              linyiqun Yiqun Lin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m