Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4440

When tez app run in yarn fed cluster, may throw NPE

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.9.3, 0.10.3
    • None
    • None

    Description

      For hadoop version before YARN-8933. When tez app is running in yarn fed cluster, getAvailableResources may return null, then throw NPE.

      2022-08-03 01:40:12,069 [ERROR] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Got Error from RMClient
      java.lang.NullPointerException
          at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.fitsIn(YarnTaskSchedulerService.java:1445)
          at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.preemptIfNeeded(YarnTaskSchedulerService.java:1218)
          at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:916)
          at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:428)
      2022-08-03 01:40:12,075 [ERROR] [AMRM Callback Handler Thread] |yarn.YarnUncaughtExceptionHandler|: Thread Thread[AMRM Callback Handler Thread,5,main] threw an Exception.
      org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.NullPointerException
          at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:432)
      Caused by: java.lang.NullPointerException
          at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.fitsIn(YarnTaskSchedulerService.java:1445)
          at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.preemptIfNeeded(YarnTaskSchedulerService.java:1218)
          at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:916)
          at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:428)

      In yarn federatiaon, AMRMProxy connect multi-rm in async way, so AllocateResponse::getAvailableResources may return null, then throw NPE.

      In my PR, I replace Resource.Instance(0,0) to null. Because null may means yarn is busy, return 0 is reasonable. 

       

       

      Attachments

        Issue Links

          Activity

            People

              zhengchenyu Chenyu Zheng
              zhengchenyu Chenyu Zheng
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h