Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-6093

Improve error handling if a container not found during replication

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • Ozone Datanode
    • None

    Description

      When a datanode receives a request to download / copy a container, if the container does not exist in the ContainerMap on the datanode the caller does not get a useful error message. For example, the caller gets a stack trace like:

      2021-12-08 12:46:50,537 ERROR org.apache.hadoop.ozone.container.replication.GrpcReplicationClient: Download of container 10009 was unsuccessful
      org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNKNOWN
      	at org.apache.ratis.thirdparty.io.grpc.Status.asRuntimeException(Status.java:533)
      	at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:453)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
      	at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      To make things worse, on the source datanode, the role log does not get anything, and instead we get this stack trace in the stderr output:

      Dec 08, 2021 12:46:50 PM org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor run
      SEVERE: Exception while executing runnable org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@62026ae8
      java.lang.NullPointerException: Container is not found 10009
      	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:897)
      	at org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:56)
      	at org.apache.hadoop.ozone.container.replication.GrpcReplicationService.download(GrpcReplicationService.java:56)
      	at org.apache.hadoop.hdds.protocol.datanode.proto.IntraDatanodeProtocolServiceGrpc$MethodHandlers.invoke(IntraDatanodeProtocolServiceGrpc.java:219)
      	at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
      	at org.apache.ratis.thirdparty.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
      	at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
      	at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:818)
      	at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
      	at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      

      The reason, is that a NullPointerException is thrown in OnDemandContainerReplicationSource and this is not caught by the caller, causing the exception to bubble up to Thread.run(), where it lands in stderr.

      The solution is to explicity handle the null container and throw an IOException which will be handed and set the response status correctly:

        public void download(CopyContainerRequestProto request,
            StreamObserver<CopyContainerResponseProto> responseObserver) {
          long containerID = request.getContainerID();
          LOG.info("Streaming container data ({}) to other datanode", containerID);
          try {
            GrpcOutputStream outputStream =
                new GrpcOutputStream(responseObserver, containerID, BUFFER_SIZE);
            source.copyData(containerID, outputStream);
          } catch (IOException e) {
            LOG.error("Error streaming container {}", containerID, e);
            responseObserver.onError(e);
          }
        }
      

      Attachments

        Issue Links

          Activity

            People

              sodonnell Stephen O'Donnell
              sodonnell Stephen O'Donnell
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: