Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-4336

ContainerInfo does not persist BCSID leading to failed replicas reports

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.1.0
    • 1.1.0
    • SCM

    Description

      If you create a container, and then close it, the BCSID is synced on the datanodes and then the value is updated in SCM via setting the "sequenceID" field on the containerInfo object for the container.

      If you later restart just SCM, the sequenceID becomes zero, and then container reports for the replica fail with a stack trace like:

      Exception in thread "EventQueue-ContainerReportForContainerReportHandler" java.lang.AssertionError
      	at org.apache.hadoop.hdds.scm.container.ContainerInfo.updateSequenceId(ContainerInfo.java:176)
      	at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerStats(AbstractContainerReportHandler.java:108)
      	at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:83)
      	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:162)
      	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:130)
      	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
      	at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      The assertion here is failing, as it does not allow for the sequenceID to be changed on a CLOSED container:

        public void updateSequenceId(long sequenceID) {
          assert (isOpen() || state == HddsProtos.LifeCycleState.QUASI_CLOSED);
          sequenceId = max(sequenceID, sequenceId);
        }
      

      The issue seems to be caused by the serialisation and deserialisation of the containerInfo object to protobuf, as sequenceId never persisted or restored.

      However, I am also confused about how this ever worked, as this is a pretty significant problem.

      Attachments

        Issue Links

          Activity

            People

              sodonnell Stephen O'Donnell
              sodonnell Stephen O'Donnell
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: