Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7176

Similar to YARN-2387:Resource Manager crashes with NPE due to lack of synchronization

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Blocker
    • Resolution: Unresolved
    • 2.6.0
    • None
    • RM
    • None

    Description

      submit a job, when the job is starting Appmaster Containers(eg. startContainers) , then send kill command.  After RM receive kill command, it will perform state store(eg.updateApplicationStateInternal).

      the startContainers  process and updateApplicationStateInternal will call the same method  ContainerLaunchContextPBImpl.getProto which lack of the synchronization(also can be called in reInitializeContainer method), the RM log will show below. 

      2017-09-08 02:34:37,967 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Error launching appattempt_1504809243340_0001_000001. Got exception: java.lang.ArrayIndexOutOfBoundsException: 3
      	at java.util.ArrayList.add(ArrayList.java:441)
      	at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:330)
      	at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$Builder.addAllApplicationACLs(YarnProtos.java:39956)
      	at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.addApplicationACLs(ContainerLaunchContextPBImpl.java:446)
      	at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToBuilder(ContainerLaunchContextPBImpl.java:121)
      	at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.mergeLocalToProto(ContainerLaunchContextPBImpl.java:128)
      	at org.apache.hadoop.yarn.api.records.impl.pb.ContainerLaunchContextPBImpl.getProto(ContainerLaunchContextPBImpl.java:70)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.convertToProtoFormat(StartContainerRequestPBImpl.java:156)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToBuilder(StartContainerRequestPBImpl.java:85)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.mergeLocalToProto(StartContainerRequestPBImpl.java:95)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainerRequestPBImpl.getProto(StartContainerRequestPBImpl.java:57)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.convertToProtoFormat(StartContainersRequestPBImpl.java:137)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.addLocalRequestsToProto(StartContainersRequestPBImpl.java:97)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToBuilder(StartContainersRequestPBImpl.java:79)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.mergeLocalToProto(StartContainersRequestPBImpl.java:72)
      	at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.StartContainersRequestPBImpl.getProto(StartContainersRequestPBImpl.java:48)
      	at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:93)
      	at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
      	at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      
      2017-09-08 02:34:37,968 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating app: application_1504809243340_0001
      java.lang.NullPointerException
      	at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
      	at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
      	at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.getSerializedSize(YarnProtos.java:38512)
      	at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
      	at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
      	at org.apache.hadoop.yarn.proto.YarnProtos$ApplicationSubmissionContextProto.getSerializedSize(YarnProtos.java:28481)
      	at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
      	at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
      	at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationStateDataProto.getSerializedSize(YarnServerResourceManagerRecoveryProtos.java:816)
      	at com.google.protobuf.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationStateInternal(FileSystemRMStateStore.java:426)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:163)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:148)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:810)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:864)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:859)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      	at java.lang.Thread.run(Thread.java:745)
      2017-09-08 02:34:37,978 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
      java.lang.NullPointerException
      	at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
      	at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
      	at org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.getSerializedSize(YarnProtos.java:38512)
      	at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
      	at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
      	at org.apache.hadoop.yarn.proto.YarnProtos$ApplicationSubmissionContextProto.getSerializedSize(YarnProtos.java:28481)
      	at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
      	at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
      	at org.apache.hadoop.yarn.proto.YarnServerResourceManagerRecoveryProtos$ApplicationStateDataProto.getSerializedSize(YarnServerResourceManagerRecoveryProtos.java:816)
      	at com.google.protobuf.AbstractMessageLite.toByteArray(AbstractMessageLite.java:62)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.updateApplicationStateInternal(FileSystemRMStateStore.java:426)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:163)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:148)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:810)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:864)
      	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:859)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
      	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
      	at java.lang.Thread.run(Thread.java:745)
      
      2017-09-08 02:34:37,987 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1504809243340_0001_01_000001 Container Transitioned from ACQUIRED to KILLED
      2017-09-08 02:34:37,987 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1504809243340_0001_01_000001 in state: KILLED event:KILL
      2017-09-08 02:34:37,987 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hires	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1504809243340_0001	CONTAINERID=container_1504809243340_0001_01_000001
      2017-09-08 02:34:37,988 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1504809243340_0001_01_000001 of capacity <memory:2048, vCores:1> on host hadoop11:45454, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8096, vCores:8> available, release resources=true
      2017-09-08 02:34:37,988 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
      2017-09-08 02:34:37,988 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:0, vCores:0> numContainers=0 user=hires user-resources=<memory:0, vCores:0>
      2017-09-08 02:34:37,989 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1504809243340_0001_01_000001, NodeId: hadoop11:45454, NodeHttpAddress: hadoop11:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.3.1.11:45454 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0 cluster=<memory:16192, vCores:16>
      2017-09-08 02:34:37,989 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0> cluster=<memory:16192, vCores:16>
      2017-09-08 02:34:37,990 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, numContainers=0
      2017-09-08 02:34:37,990 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1504809243340_0001_000001 released container container_1504809243340_0001_01_000001 on node: host: hadoop11:45454 #containers=0 available=8096 used=0 with event: KILL
      2017-09-08 02:34:37,990 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1504809243340_0001 requests cleared
      2017-09-08 02:34:37,990 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application removed - appId: application_1504809243340_0001 user: hires queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
      2017-09-08 02:34:38,001 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
      2017-09-08 02:34:38,005 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@hadoop11:8088
      2017-09-08 02:34:38,005 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
      2017-09-08 02:34:38,006 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
      2017-09-08 02:34:38,108 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032
      2017-09-08 02:34:38,113 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8032
      2017-09-08 02:34:38,113 INFO org.apache.hadoop.ipc.Server: Stopping server on 8033
      2017-09-08 02:34:38,114 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
      2017-09-08 02:34:38,114 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8033
      2017-09-08 02:34:38,114 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
      

      Attachments

        1. YARN-7176_3.patch
          6 kB
          lujie
        2. YARN_7176_2.patch
          6 kB
          lujie
        3. YARN-7176.patch
          1 kB
          lujie
        4. logs.rar
          21 kB
          lujie

        Activity

          People

            xiaoheipangzi lujie
            xiaoheipangzi lujie
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: