Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-5249

Race Condition between Full and Incremental Container Reports

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: SCM
    • Target Version/s:

      Description

      During testing we came across an issue with ICR and FCR handing.

      The following log shows the issue:

      2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica of container #1 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
      
      
      2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.IncrementalContainerReportHandler: Processing replica of container #1001 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
      
      
      2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica of container #2 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
      2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica of container #3 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
      2021-05-18 13:14:15,394 DEBUG org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Processing replica of container #4 from datanode 945aa180-5cff-4298-a8ad-8197542e4562{ip: 172.27.108.136, host: quasar-nqdywv-7.quasar-nqdywv.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
      ...
      

      In the above log, SCM is processing both an ICR and FCR for the same Datanode at the same time. The FCR does not container container #1001.

      The FCR starts first, and it takes a snapshot of the containers on the node via NodeManager.

      Then it starts processing the containers one by one.

      The ICR then starts, and it added #1001 to the ContainerManager and to the NodeManager.

      When the FCR completes, it replaces the list of containers in NodeManager with those in the FCR.

      At this point, container #1001 is in the ContainerManager, but it is not listed against the node in NodeManager.

      This would get fixed by the next FCR, but then the node goes dead. The dead node handler runs and uses the list of containers in NodeManager to remove all containers for the node. As #1001 is not listed, it is not removed by the DeadNodeManager. This means the container will never been seen as under replicated, as 3 copies will exist forever in the ContainerManager.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sodonnell Stephen O'Donnell
                Reporter:
                nilotpalnandi Nilotpal Nandi
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: