[HDDS-4580] Datanode can be stuck in leader not ready state after restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: Ozone Datanode
Labels:
- pull-request-available

Target Version/s:

1.1.0

Description

On restart the transactions are reapplied for an existing ratis pipeline. ContainerStateMachine#applyTransaction while processing future can throw NullPointerException leading to the future being completed exceptionally.

      future.thenApply(r -> {
        if (trx.getServerRole() == RaftPeerRole.LEADER) {
          long startTime = (long) trx.getStateMachineContext();
          metrics.incPipelineLatency(cmdType,
              Time.monotonicNowNanos() - startTime);
        }

In the above code snippet trx.getStateMachineContext() will be null during restart and this fails the future itself without updating the applyTransactionCompletionMap. As a result the lastAppliedIndex is not updated for the server and server is stuck in leader not ready state.

Attachments

Issue Links

links to

GitHub Pull Request #1690

Activity

People

Assignee:: Lokesh Jain

Reporter:: Lokesh Jain

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Dec/20 14:54

Updated:: 14/Dec/20 21:50

Resolved:: 14/Dec/20 21:50