Uploaded image for project: 'Hadoop Distributed Data Store'
  1. Hadoop Distributed Data Store
  2. HDDS-3559

Datanode doesn't handle java heap OutOfMemory exception

    XMLWordPrintableJSON

    Details

    • Target Version/s:

      Description

      2020-05-05 15:47:41,568 [Datanode State Machine Thread - 167] WARN org.apache.hadoop.ozone.container.common.statemachine.Endpoi
      ntStateMachine: Unable to communicate to SCM server at host-10-51-87-181:9861 for past 0 seconds.
      java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Java heap space
              at org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
              at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:118)
              at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.sendHeartbeat(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:148)
              at org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:145)
              at org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:76)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Java heap space
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.getReturnMessage(ProtobufRpcEngine.java:293)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:270)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
              at com.sun.proxy.$Proxy38.submitRequest(Unknown Source)
              at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:116)
       
      On a cluster, one datanode stops reporting to SCM while being kept unknown. The datanode process is still working. Log shows Java heap OOM when it's serializing protobuf for rpc message. However, datanode silently stops reports to SCM and the process becomes stale.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              licheng Li Cheng
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: