Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8242

YARN NM: OOM error while reading back the state store on recovery

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.6.0, 2.9.0, 2.6.5, 2.8.3, 3.1.0, 2.7.6, 3.0.2
    • Fix Version/s: 3.2.0, 3.1.2
    • Component/s: yarn
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      On startup the NM reads its state store and builds a list of application in the state store to process. If the number of applications in the state store is large and have a lot of "state" connected to it the NM can run OOM and never get to the point that it can start processing the recovery.
      Since it never starts the recovery there is no way for the NM to ever pass this point. It will require a change in heap size to get the NM started.

       

      Following is the stack trace

      at java.lang.OutOfMemoryError.<init> (OutOfMemoryError.java:48) at com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto.<init> (YarnProtos.java:47069) at org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto.<init> (YarnProtos.java:47014) at org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom (YarnProtos.java:47102) at org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage (CodedInputStream.java:309) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.<init> (YarnProtos.java:41016) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.<init> (YarnProtos.java:40942) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom (YarnProtos.java:41080) at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage (CodedInputStream.java:309) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.<init> (YarnServiceProtos.java:24517) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.<init> (YarnServiceProtos.java:24464) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom (YarnServiceProtos.java:24568) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom (YarnServiceProtos.java:24563) at com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom (YarnServiceProtos.java:24739) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState (NMLeveldbStateStoreService.java:217) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState (NMLeveldbStateStoreService.java:170) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover (ContainerManagerImpl.java:253) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit (ContainerManagerImpl.java:237) at org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit (CompositeService.java:107) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager (NodeManager.java:474) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main (NodeManager.java:521)

        Attachments

        1. YARN-8242.008.patch
          78 kB
          Pradeep Ambati
        2. YARN-8242.007.patch
          84 kB
          Pradeep Ambati
        3. YARN-8242.006.patch
          75 kB
          Pradeep Ambati
        4. YARN-8242.005.patch
          74 kB
          Pradeep Ambati
        5. YARN-8242.004.patch
          52 kB
          Pradeep Ambati
        6. YARN-8242.003.patch
          28 kB
          Kanwaljeet Sachdev
        7. YARN-8242.002.patch
          26 kB
          Kanwaljeet Sachdev
        8. YARN-8242.001.patch
          24 kB
          Kanwaljeet Sachdev

          Issue Links

            Activity

              People

              • Assignee:
                pradeepambati Pradeep Ambati
                Reporter:
                kanwaljeets Kanwaljeet Sachdev
              • Votes:
                0 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: