Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4325

Nodemanager log handlers fail to send finished/failed events in some cases

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      From a long running cluster, we found tens of thousands of stale apps still be recovered in NM restart recovery.
      After investigating, there are three issues cause app state leak in NM state-store:
      1. APPLICATION_LOG_HANDLING_FAILED is not handled with remove App in NMStateStore.
      2. APPLICATION_LOG_HANDLING_FAILED event is missing in sent when hit aggregator's doAppLogAggregation() exception case.
      3. Only Application in FINISHED status receiving APPLICATION_LOG_FINISHED has transition to remove app in NM state store. Application in other status - like APPLICATION_RESOURCES_CLEANUP will ignore the event and later forget to remove this app from NM state store even after app get finished.

        Attachments

        1. YARN-4325-v4.1.patch
          15 kB
          Junping Du
        2. YARN-4325-v4.patch
          15 kB
          Junping Du
        3. YARN-4325-v3.1.patch
          14 kB
          Junping Du
        4. YARN-4325-v3.patch
          14 kB
          Junping Du
        5. YARN-4325-v2.patch
          9 kB
          Junping Du
        6. YARN-4325-v1.1.patch
          12 kB
          Junping Du
        7. YARN-4325-v1.patch
          12 kB
          Junping Du
        8. YARN-4325.patch
          7 kB
          Junping Du
        9. ApplicationImpl.PNG
          82 kB
          Junping Du

          Issue Links

            Activity

              People

              • Assignee:
                djp Junping Du
                Reporter:
                djp Junping Du
              • Votes:
                0 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: