Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6661

Too much CLEANUP event hang ApplicationMasterLauncher thread pool

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.2
    • Fix Version/s: None
    • Component/s: fairscheduler
    • Labels:
      None
    • Environment:

      hadoop 2.7.2

    • Target Version/s:

      Description

      Some one else have already come up with the similar problem and fix it.
      We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for detail.
      But I think the fix have not solve the problem completely, blow was the problem I encountered:
      There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
      I failover my active rm and rm will failover all those 1800 apps.
      When a application failover, It will wait for AM container register itself.
      But there is a bug in my AM (I do it intentionally), and it will not register itself.
      So the RM will wait for about 10mins for the AM expiration, and it will send a CLEANUP event to
      ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so it will hang the ApplicationMasterLauncher
      thread pool for a large time. I have already use the patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), so
      a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so for each of my thread, it will
      hang 1800 / 50 * 200s = 7200s=20min.
      Because the AM have register itself during 10mins´╝î so it will retry and create a new application attempt.
      The application attempt will accept a container from RM, and send a LAUNCH to ApplicationMasterLauncher thread pool.
      Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the application attempt will not
      start the AM container during 10min.
      And it will expire, and send a CLEANUP event to ApplicationMasterLauncher thread pools too.
      As you can see, none of my application can really run it.
      Each of them have 5 application attempts as follows, and each of them keep retrying.
      appattempt_1495786030132_4000_000005
      appattempt_1495786030132_4000_000004
      appattempt_1495786030132_4000_000003
      appattempt_1495786030132_4000_000002
      appattempt_1495786030132_4000_000001
      So all of my apps have hang several hours, and none of them can really run.
      I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
      And use some other thread to deal with LAUNCH event or use other way.
      Sorry, I english is so poor. I don't know have I describe it clearly.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              zhouyunfan JackZhou
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: