Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3116

Datanode sometimes fails to start with NPE when starting Ratis xceiver server

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.5.0
    • 0.5.0
    • Ozone Datanode
    • None

    Description

      While working on a network Topology test (HDDS-3084) which does the following:

      1. Start a cluster with 6 DNs and 2 racks.

      2. Create a volume, bucket and a single key.

      3. Stop one rack of hosts using "docker-compose down"

      4. Read the data from the single key

      5. Start the 3 down hosts

      6. Stop the other 3 hosts

      7. Attempt to read the key again.

      At step 5 I sometimes see this stack trace in one of the DNs and it fails to full come up:

      2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ozoneimpl.OzoneContainer: Attempting to start container services.
      2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ozoneimpl.OzoneContainer: Background container scanner has been disabled.
      2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ratis.XceiverServerRatis: Starting XceiverServerRatis 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
      2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN statemachine.EndpointStateMachine: Unable to communicate to SCM server at scm:9861 for past 15000 seconds.
      java.io.IOException: java.lang.NullPointerException
      	at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
      	at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
      	at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
      	at org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
      	at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
      	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
      	at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
      	at org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
      	at org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.lang.NullPointerException
      	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
      	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
      	at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
      	at org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
      	at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:112)
      	at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:112)
      	at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
      	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
      	... 3 more
      

      The DN does not recover from this automatically, although I confirmed that a full cluster restart fixed it (docker-compose stop; docker-compose start). I will try to confirm if a restart of the stuck DN would fix it or not too.

      Attachments

        1. full_logs.txt
          912 kB
          Stephen O'Donnell

        Issue Links

          Activity

            People

              sodonnell Stephen O'Donnell
              sodonnell Stephen O'Donnell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m