Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8222

Fix potential NPE when gets RMApp from RM context

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0, 3.1.1, 3.0.3, 2.10.2
    • None
    • None
    • Reviewed

    Description

      Recently we did some performance tests and found two NPE problems when calling rmContext.getRMApps().get(appId).get...
      These NPE problems occasionally happened when doing performance tests with large number and fast-finished applications. We have checked other places which call rmContext.getRMApps().get(...), most of them have null check and some does not need (The process can guarantee that the return result will not be null).
      To fix these problems, We can add a null check for application before getting attempt form it.

      (1) NPE in RMContainerImpl$FinishedTransition#updateAttemptMetrics

      java.lang.NullPointerException
              at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.updateAttemptMetrics(RMContainerImpl.java:742)
              at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:715)
              at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl$FinishedTransition.transition(RMContainerImpl.java:699)
              at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
              at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:482)
              at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:64)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.containerCompleted(FiCaSchedulerApp.java:195)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1793)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:2624)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:663)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:1514)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:2396)
              at org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:205)
              at org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:60)
              at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
              at java.lang.Thread.run(Thread.java:834)
      

      This NPE looks like happen when node heartbeat delay and try to update attempt metrics for a non-exist app.
      Reference code of RMContainerImpl$FinishedTransition#updateAttemptMetrics:

      private static void updateAttemptMetrics(RMContainerImpl container) {
            Resource resource = container.getContainer().getResource();
            RMAppAttempt rmAttempt = container.rmContext.getRMApps()
                .get(container.getApplicationAttemptId().getApplicationId())
                .getCurrentAppAttempt();
      
            if (rmAttempt != null) {
               //....
            }
      }
      

      (2) NPE in SchedulerApplicationAttempt#incNumAllocatedContainers

      java.lang.NullPointerException
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.incNumAllocatedContainers(SchedulerApplicationAttempt.java:1268)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.apply(FiCaSchedulerApp.java:638)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:3589)
              at org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.tryCommit(SLSCapacityScheduler.java:142)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:962)
      

      This NPE should happen when apply a outdated proposal for a non-existed application in rmContext.
      Reference code:

          RMAppAttempt attempt =
              rmContext.getRMApps().get(attemptId.getApplicationId())
                .getCurrentAppAttempt();
          if (attempt != null) {
            attempt.getRMAppAttemptMetrics().incNumAllocatedContainers(containerType,
              requestType);
          }
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Tao Yang Tao Yang Assign to me
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment