Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4536

DelayedProcessKiller may not work under heavy workload

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.7.1
    • None
    • nodemanager
    • None

    Description

      I am now facing with orphan process of container. Here is the scenario:
      With heavy task load, the NM machine CPU usage can reach almost 100%. When some container got event of kill, it will get SIGTERM , and then the parent process exit, leave the container process to OS. This container process need handle some shutdown events or some logic, but hardly can get CPU, we suppose to see a SIGKILL as there is DelayedProcessKiller ,but the parent process which persisted as container pid no longer exist, so the kill command can not reach the container process. This is how orphan container process come.
      The orphan process do exit after some time, but the period can be very long, and will make the OS status worse. As I observed, the period can be several hours

      Attachments

        Activity

          People

            Unassigned Unassigned
            gu chi gu-chi
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: