Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • resourcemanager
    • None
    • Reviewed

    Description

      Current implementation nodelistmanager event blocks async dispacher and can cause RM crash and slowing down event processing.

      1. Cluster restart with 1K running apps . Each usable event will create 1K events over all events could be 5k*1k events for 5K cluster
      2. Event processing is blocked till new events are added to queue.

      Solution :

      1. Add another async Event handler similar to scheduler.
      2. Instead of adding events to dispatcher directly call RMApp event handler.

      Attachments

        1. YARN-9618.001.patch
          5 kB
          Qi Zhu
        2. YARN-9618.002.patch
          13 kB
          Qi Zhu
        3. YARN-9618.003.patch
          13 kB
          Qi Zhu
        4. YARN-9618.004.patch
          17 kB
          Qi Zhu
        5. YARN-9618.005.patch
          17 kB
          Qi Zhu
        6. YARN-9618.006.patch
          7 kB
          Qi Zhu
        7. YARN-9618.007.patch
          7 kB
          Qi Zhu

        Issue Links

          Activity

            People

              zhuqi Qi Zhu
              bibinchundatt Bibin Chundatt
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: