Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3116

Datanode sometimes fails to start with NPE when starting Ratis xceiver server

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.5.0
    • 0.5.0
    • Ozone Datanode
    • None

    Description

      While working on a network Topology test (HDDS-3084) which does the following:

      1. Start a cluster with 6 DNs and 2 racks.

      2. Create a volume, bucket and a single key.

      3. Stop one rack of hosts using "docker-compose down"

      4. Read the data from the single key

      5. Start the 3 down hosts

      6. Stop the other 3 hosts

      7. Attempt to read the key again.

      At step 5 I sometimes see this stack trace in one of the DNs and it fails to full come up:

      2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ozoneimpl.OzoneContainer: Attempting to start container services.
      2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ozoneimpl.OzoneContainer: Background container scanner has been disabled.
      2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO ratis.XceiverServerRatis: Starting XceiverServerRatis 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
      2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN statemachine.EndpointStateMachine: Unable to communicate to SCM server at scm:9861 for past 15000 seconds.
      java.io.IOException: java.lang.NullPointerException
      	at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
      	at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
      	at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
      	at org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
      	at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
      	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
      	at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
      	at org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
      	at org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.lang.NullPointerException
      	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
      	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
      	at org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
      	at org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
      	at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:112)
      	at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:112)
      	at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
      	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
      	... 3 more
      

      The DN does not recover from this automatically, although I confirmed that a full cluster restart fixed it (docker-compose stop; docker-compose start). I will try to confirm if a restart of the stuck DN would fix it or not too.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sodonnell Stephen O'Donnell
            sodonnell Stephen O'Donnell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m

                Slack

                  Issue deployment