Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-270

RM scheduler event handler thread gets behind

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 0.23.5
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      We had a couple of incidents on a 2800 node cluster where the RM scheduler event handler thread got behind processing events and basically become unusable. It was still processing apps, but taking a long time (1 hr 45 minutes) to accept new apps. this actually happened twice within 5 days.

      We are using the capacity scheduler and at the time had between 400 and 500 applications running. There were another 250 apps that were in the SUBMITTED state in the RM but the scheduler hadn't processed those to put in pending state yet. We had about 15 queues none of them hierarchical. We also had plenty of space lefts on the cluster.

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          tgraves Thomas Graves added a comment -

          I was able to reproduce this on a smaller cluster by simulating 2800 nodes - had 720 node manager and made the heartbeat internal 250 ms (instead of 1s). At about 400 applications the scheduler queue starts to grow. I am still in the process of investigating what exactly is taking the time.

          Note that for now we believe we have worked around it by increasing the nodemanager heartbeat interval from 1 second to 3 seconds.

          Show
          tgraves Thomas Graves added a comment - I was able to reproduce this on a smaller cluster by simulating 2800 nodes - had 720 node manager and made the heartbeat internal 250 ms (instead of 1s). At about 400 applications the scheduler queue starts to grow. I am still in the process of investigating what exactly is taking the time. Note that for now we believe we have worked around it by increasing the nodemanager heartbeat interval from 1 second to 3 seconds.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Thanks for filing this Thomas. IIRC, The event-handler's upper limit is about 0.6 million, somehow we only focus on number of nodes and never thought about the scaling issue with large number of applications. There are multiple solutions for this, in the order of importance:

          • Make NodeManagers to NOT blindly heartbeat irrespective of whether previous heartbeat is processed or not.
          • Figure out any obvious bottlenecks in the scheduling code.
          • When all else fails, try to parallelize the scheduler dispatcher.
          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Thanks for filing this Thomas. IIRC, The event-handler's upper limit is about 0.6 million, somehow we only focus on number of nodes and never thought about the scaling issue with large number of applications. There are multiple solutions for this, in the order of importance: Make NodeManagers to NOT blindly heartbeat irrespective of whether previous heartbeat is processed or not. Figure out any obvious bottlenecks in the scheduling code. When all else fails, try to parallelize the scheduler dispatcher.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Make NodeManagers to NOT blindly heartbeat irrespective of whether previous heartbeat is processed or not.

          Filed YARN-275

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Make NodeManagers to NOT blindly heartbeat irrespective of whether previous heartbeat is processed or not. Filed YARN-275
          Hide
          nroberts Nathan Roberts added a comment -

          Could we also add some additional flow control within the RM to prevent this work from getting into the event queues in the first place? Having the clients throttle on their end is important in the short term but in the long run we need a flow control strategy that can exert back pressure at all stages of the pipeline.

          Show
          nroberts Nathan Roberts added a comment - Could we also add some additional flow control within the RM to prevent this work from getting into the event queues in the first place? Having the clients throttle on their end is important in the short term but in the long run we need a flow control strategy that can exert back pressure at all stages of the pipeline.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Nathan, unfortunately, the dispatcher framework cannot exert back pressure in general, each event producer needs to control itself.

          OTOH, YARN-275 is indeed a long term fix. NMs back off just like the TTs do in 1.*.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Nathan, unfortunately, the dispatcher framework cannot exert back pressure in general, each event producer needs to control itself. OTOH, YARN-275 is indeed a long term fix. NMs back off just like the TTs do in 1.*.
          Hide
          revans2 Robert Joseph Evans added a comment -

          It cannot exert back pressure currently, but I don't see any reason to think that it could not be added in the future. Something as simple as setting a high water mark on the number of pending events and throttling events from incoming connections until the congestion subsides.

          We have see a similar issue in the IPC layer on the AM when too many reducers were trying to download the mapper locations. Granted this is not the same code, but it was caused by asynchronously handling events and buffering up the data so when we got behind we eventually got OOMs. I think we will continue to see more issues as we scale up until we solve it generally, or every single client API call will have to be updated eventually to avoid overloading the system.

          Show
          revans2 Robert Joseph Evans added a comment - It cannot exert back pressure currently, but I don't see any reason to think that it could not be added in the future. Something as simple as setting a high water mark on the number of pending events and throttling events from incoming connections until the congestion subsides. We have see a similar issue in the IPC layer on the AM when too many reducers were trying to download the mapper locations. Granted this is not the same code, but it was caused by asynchronously handling events and buffering up the data so when we got behind we eventually got OOMs. I think we will continue to see more issues as we scale up until we solve it generally, or every single client API call will have to be updated eventually to avoid overloading the system.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          I don't see it yet, but a comprehensive proposal can clarify what you have in mind I suppose. Please file a sub-ticket and propose what you think. Tx.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - I don't see it yet, but a comprehensive proposal can clarify what you have in mind I suppose. Please file a sub-ticket and propose what you think. Tx.
          Hide
          xgong Xuan Gong added a comment -

          break the sub-task 275 Make NodeManagers to NOT blindly heartbeat irrespective of whether previous heartbeat is processed or not. to smaller task.
          1. Make RM provide heartbeat interval to NM
          2. RM changes to handle NM heartbeat during overload.

          Show
          xgong Xuan Gong added a comment - break the sub-task 275 Make NodeManagers to NOT blindly heartbeat irrespective of whether previous heartbeat is processed or not. to smaller task. 1. Make RM provide heartbeat interval to NM 2. RM changes to handle NM heartbeat during overload.
          Hide
          sharadag Sharad Agarwal added a comment -

          When all else fails, try to parallelize the scheduler dispatcher

          long term i think this should be the solution. we need ordering of events only for a given event type. so this should be doable and will give next level of scalability both for AM and RM

          Show
          sharadag Sharad Agarwal added a comment - When all else fails, try to parallelize the scheduler dispatcher long term i think this should be the solution. we need ordering of events only for a given event type. so this should be doable and will give next level of scalability both for AM and RM
          Hide
          revans2 Robert Joseph Evans added a comment -

          I agree that part of the fix needs to be making the scheduler parallel, but we also need a general way to apply back pressure otherwise there will always be a way to accidentally bring down the system with a DOS. We recently saw what appears to be a very similar issue show up on an MRAppMaster. We still don't understand exactly what triggered it, but a job that would typically take 5 to 10 mins to complete was still running 17 hours later because the queue filled up which caused the JVM to start garbage collecting like crazy which in turn made it so it could not process all of the events coming in, which made the queue fill up even more. We plan to address this in the short term by making the JVM OMM much sooner than is the default, but it is still just a band-aid on the underlying problem that unless there is back pressure there is always the possibility for incoming requests to overwhelm the system.

          Show
          revans2 Robert Joseph Evans added a comment - I agree that part of the fix needs to be making the scheduler parallel, but we also need a general way to apply back pressure otherwise there will always be a way to accidentally bring down the system with a DOS. We recently saw what appears to be a very similar issue show up on an MRAppMaster. We still don't understand exactly what triggered it, but a job that would typically take 5 to 10 mins to complete was still running 17 hours later because the queue filled up which caused the JVM to start garbage collecting like crazy which in turn made it so it could not process all of the events coming in, which made the queue fill up even more. We plan to address this in the short term by making the JVM OMM much sooner than is the default, but it is still just a band-aid on the underlying problem that unless there is back pressure there is always the possibility for incoming requests to overwhelm the system.
          Hide
          tgraves Thomas Graves added a comment -

          changing this to not be a blocker since we have worked around and some of subtasks complete.

          Show
          tgraves Thomas Graves added a comment - changing this to not be a blocker since we have worked around and some of subtasks complete.

            People

            • Assignee:
              tgraves Thomas Graves
              Reporter:
              tgraves Thomas Graves
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:

                Development