Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2387

Resource Manager crashes with NPE due to lack of synchronization

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.5.0, 3.0.0-alpha1
    • Fix Version/s: 2.6.0
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      We recently came across a 0.23 RM crashing with an NPE. Here is the stacktrace for it.

      2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL
      org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in
      handling event type NODE_UPDATE to the scheduler
      java.lang.NullPointerException
              at
      org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61)
              at
      org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68)
              at
      org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53)
              at
      org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34)
              at
      org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55)
              at java.lang.String.valueOf(String.java:2854)
              at java.lang.StringBuilder.append(StringBuilder.java:128)
              at
      org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353)
              at java.lang.String.valueOf(String.java:2854)
              at java.lang.StringBuilder.append(StringBuilder.java:128)
              at
      org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405)
              at
      org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790)
              at
      org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602)
              at
      org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688)
              at
      org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82)
              at
      org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339)
              at java.lang.Thread.run(Thread.java:722)
      2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO
      org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
      

      On investigating a on the issue we found that the ContainerStatusPBImpl has methods that are called by different threads and are not synchronized. Even the 2.X code looks alike.

      We need to make these methods synchronized so that we do not encounter this problem in future.

        Attachments

        1. YARN-2387.patch
          1 kB
          Mit Desai
        2. YARN-2387.patch
          3 kB
          Mit Desai
        3. YARN-2387.patch
          3 kB
          Mit Desai

          Activity

            People

            • Assignee:
              mitdesai Mit Desai
              Reporter:
              mitdesai Mit Desai
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: