Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-4580

Datanode can be stuck in leader not ready state after restart

    XMLWordPrintableJSON

Details

    Description

      On restart the transactions are reapplied for an existing ratis pipeline. ContainerStateMachine#applyTransaction while processing future can throw NullPointerException leading to the future being completed exceptionally.

            future.thenApply(r -> {
              if (trx.getServerRole() == RaftPeerRole.LEADER) {
                long startTime = (long) trx.getStateMachineContext();
                metrics.incPipelineLatency(cmdType,
                    Time.monotonicNowNanos() - startTime);
              }
      

      In the above code snippet trx.getStateMachineContext() will be null during restart and this fails the future itself without updating the applyTransactionCompletionMap. As a result the lastAppliedIndex is not updated for the server and server is stuck in leader not ready state.

      Attachments

        Issue Links

          Activity

            People

              ljain Lokesh Jain
              ljain Lokesh Jain
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: