Hadoop YARN
  1. Hadoop YARN
  2. YARN-110

AM releases too many containers due to the protocol

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager, scheduler
    • Labels:
      None

      Description

      • AM sends request asking 4 containers on host H1.
      • Asynchronously, host H1 reaches RM and gets assigned 4 containers. RM at this point, sets the value against H1 to
        zero in its aggregate request-table for all apps.
      • In the mean-while AM gets to need 3 more containers, so a total of 7 including the 4 from previous request.
      • Today, AM sends the absolute number of 7 against H1 to RM as part of its request table.
      • RM seems to be overriding its earlier value of zero against H1 to 7 against H1. And thus allocating 7 more
        containers.
      • AM already gets 4 in this scheduling iteration, but gets 7 more, a total of 11 instead of the required 7.
      1. YARN-110.patch
        6 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          Giovanni Matteo Fumarola added a comment -

          Karthik Kambatla My idea is to keep a list of requested containers in AppSchedulingInfo.
          When the RM sends containers to the AM and in the same heartbeat the AM asks containers, the adding check forwards to the capacity scheduler the correct number of containers.

          After the vacation I will rebase my patch and I will push it.

          Show
          Giovanni Matteo Fumarola added a comment - Karthik Kambatla My idea is to keep a list of requested containers in AppSchedulingInfo. When the RM sends containers to the AM and in the same heartbeat the AM asks containers, the adding check forwards to the capacity scheduler the correct number of containers. After the vacation I will rebase my patch and I will push it.
          Hide
          Karthik Kambatla added a comment -

          Got it. See the value in fixing it. Proposals on how to?

          Show
          Karthik Kambatla added a comment - Got it. See the value in fixing it. Proposals on how to?
          Hide
          Arun Suresh added a comment -

          Correct me if I am wrong Giovanni Matteo Fumarola
          Assume the situation mentioned in the Description.
          Currently, as per MAPREDUCE-4671, once an AM has received all its containers, the MR AM will just send 0 container count ask to cancel any outstanding requests. That implies that the RM will have to wait for the next allocate call from the AM to free up any erroneously granted resources (7 containers in the example) which means that any other AM running on a fully utilized cluster will have to wait for an allocate heartbeat, or two, to satisfy its resource ask.

          Show
          Arun Suresh added a comment - Correct me if I am wrong Giovanni Matteo Fumarola Assume the situation mentioned in the Description. Currently, as per MAPREDUCE-4671 , once an AM has received all its containers, the MR AM will just send 0 container count ask to cancel any outstanding requests. That implies that the RM will have to wait for the next allocate call from the AM to free up any erroneously granted resources (7 containers in the example) which means that any other AM running on a fully utilized cluster will have to wait for an allocate heartbeat, or two, to satisfy its resource ask.
          Hide
          Karthik Kambatla added a comment -

          but it looks like this increases latencies for competing AMs

          Not sure I fully understand this. Care to elaborate with a concrete example?

          Show
          Karthik Kambatla added a comment - but it looks like this increases latencies for competing AMs Not sure I fully understand this. Care to elaborate with a concrete example?
          Hide
          Arun Suresh added a comment -

          Karthik Kambatla, Vinod Kumar Vavilapalli, I understand from MAPREDUCE-4671 that the accounting burden for this has been pushed to the AM and it will not pose a latency issue for the AM requesting the resources, but it looks like this increases latencies for competing AMs (they might have to wait for subsequent allocate call for the resources). Also Custom AMs would need to be cognizant of this.

          It also looks like Giovanni Matteo Fumarola is hitting this on some of the clusters he is working on. If Arun C Murthy is not actively looking into this, he would like to volunteer a patch.

          Thoughts ?

          Show
          Arun Suresh added a comment - Karthik Kambatla , Vinod Kumar Vavilapalli , I understand from MAPREDUCE-4671 that the accounting burden for this has been pushed to the AM and it will not pose a latency issue for the AM requesting the resources, but it looks like this increases latencies for competing AMs (they might have to wait for subsequent allocate call for the resources). Also Custom AMs would need to be cognizant of this. It also looks like Giovanni Matteo Fumarola is hitting this on some of the clusters he is working on. If Arun C Murthy is not actively looking into this, he would like to volunteer a patch. Thoughts ?
          Hide
          Giovanni Matteo Fumarola added a comment -

          Arun C Murthy, Vinod Kumar Vavilapalli any updates on this?
          If you don't mind, can I work on this?

          Show
          Giovanni Matteo Fumarola added a comment - Arun C Murthy , Vinod Kumar Vavilapalli any updates on this? If you don't mind, can I work on this?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12546372/YARN-110.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          -1 javac. The patch appears to cause the build to fail.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/48//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12546372/YARN-110.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javac. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/48//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Simple patch to resolve the difference b/w the AM's view of the world with the RM within the 'transaction' i.e. the allocate call.

          The essential idea is for the RM to account for the newly allocated containers since the last AM heartbeat while updating #containers for * (ANY).

          Show
          Arun C Murthy added a comment - Simple patch to resolve the difference b/w the AM's view of the world with the RM within the 'transaction' i.e. the allocate call. The essential idea is for the RM to account for the newly allocated containers since the last AM heartbeat while updating #containers for * (ANY).
          Hide
          Amol Kekre added a comment -

          Any updates on this?

          Show
          Amol Kekre added a comment - Any updates on this?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Reopening the issue as the discussion is still happening.

          Show
          Vinod Kumar Vavilapalli added a comment - Reopening the issue as the discussion is still happening.
          Hide
          Sharad Agarwal added a comment -

          For clarity, describing the full example:

          • AM asks 4 containers on H1.
          • Asynchronously, host H1 reaches RM and gets assigned 4 containers. RM at this point, sets the value against H1 to
            zero in its aggregate request-table for all apps.
          • In the mean-while AM gets to need 3 more containers on H2, so a total of 7 (4(H1), 3(H2))
          • The request table in RM will get overwritten with (4(H1), 3(H2)), total 7.
          • RM may do locality aware allocations on H1 or H2. But in reality it should only try to assign on H2.
          • If RM gives containers to AM on H1 first, AM will do off host assignments and will release the ones on H2.
          Show
          Sharad Agarwal added a comment - For clarity, describing the full example: AM asks 4 containers on H1. Asynchronously, host H1 reaches RM and gets assigned 4 containers. RM at this point, sets the value against H1 to zero in its aggregate request-table for all apps. In the mean-while AM gets to need 3 more containers on H2, so a total of 7 (4(H1), 3(H2)) The request table in RM will get overwritten with (4(H1), 3(H2)), total 7. RM may do locality aware allocations on H1 or H2. But in reality it should only try to assign on H2. If RM gives containers to AM on H1 first, AM will do off host assignments and will release the ones on H2.
          Hide
          Sharad Agarwal added a comment -

          >> Yes, this could lead to some waste, but the system is eventually consistent.

          There will be lot of waste. Particularly for applications which are ramping up the requests with time, and not putting all the requests upfront.
          Another issue will be the sub-optimal scheduling in AM.
          In the above example:

          • AM asks for 3 additional containers(total 7) on different host H2.
          • The request table in AM will get overwritten with 4 on H1 and 3 on H2.
          • RM may allocate containers on H1 or H2. But in reality it should only try to assign on H2.
          • If RM gives containers to AM on H1 first, AM will do off host assignments and will release the ones on H2.
          Show
          Sharad Agarwal added a comment - >> Yes, this could lead to some waste, but the system is eventually consistent. There will be lot of waste. Particularly for applications which are ramping up the requests with time, and not putting all the requests upfront. Another issue will be the sub-optimal scheduling in AM. In the above example: AM asks for 3 additional containers(total 7) on different host H2. The request table in AM will get overwritten with 4 on H1 and 3 on H2. RM may allocate containers on H1 or H2. But in reality it should only try to assign on H2. If RM gives containers to AM on H1 first, AM will do off host assignments and will release the ones on H2.
          Hide
          Arun C Murthy added a comment -

          I thought more about this... I've spent time on this in the beginning and I don't think a delta protocol is appropriate - particularly since the RM and AM aren't 'in sync' ever.

          Yes, this could lead to some waste, but the system is eventually consistent.

          Show
          Arun C Murthy added a comment - I thought more about this... I've spent time on this in the beginning and I don't think a delta protocol is appropriate - particularly since the RM and AM aren't 'in sync' ever. Yes, this could lead to some waste, but the system is eventually consistent.

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:

                Development