Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4879

Enhance Allocate Protocol to Identify Requests Explicitly

    Details

    • Hadoop Flags:
      Reviewed

      Description

      For legacy reasons, the current allocate protocol expects expanded requests which represent the cumulative request for any change in resource constraints. This is not only very difficult to comprehend but makes it impossible for the scheduler to associate container allocations to the original requests. This problem is amplified by the fact that the expansion is managed by the AMRMClient which makes it cumbersome for non-Java clients as they all have to replicate the non-trivial logic. In this JIRA, we are proposing enhancement to the Allocate Protocol to allow AMs to identify requests explicitly.

      1. SimpleAllocateProtocolProposal-v1.pdf
        182 kB
        Subru Krishnan
      2. SimpleAllocateProtocolProposal-v2.pdf
        189 kB
        Subru Krishnan

        Issue Links

          Activity

          Hide
          subru Subru Krishnan added a comment -

          Attaching a design doc that details the simple delta allocate protocol. Thanks to Arun Suresh, Vinod Kumar Vavilapalli, Karthik Kambatla,Wangda Tan, Jian He,Carlo Curino,Kishore Chaliparambil and others for all the thoughts and feedback during the many offline discussions.

          Show
          subru Subru Krishnan added a comment - Attaching a design doc that details the simple delta allocate protocol. Thanks to Arun Suresh , Vinod Kumar Vavilapalli , Karthik Kambatla , Wangda Tan , Jian He , Carlo Curino , Kishore Chaliparambil and others for all the thoughts and feedback during the many offline discussions.
          Hide
          giovanni.fumarola Giovanni Matteo Fumarola added a comment -

          Thanks Arun Suresh and Subru Krishnan to start this. Everything looks good and easy to read and understand.

          Few Comments:

          The allocated Container(s) received as part of the response will have the ID corresponding to the original ResourceRequest for which the RM made the allocation.

          Are you going to add the ResourceRequestId as field of Container?

          You guys wrote

          At high level, we propose to add an ID field to ResourceRequest

          after few lines you wrote

          The resource-request data structure in RM’s AppSchedulingInfo will be updated to Map<Priority, Map<String, Map<Resource, Map<int, ResourceRequest>>>>.


          You will not need a int field in the map, because it is already part of the ResourceRequest. Do you want to keep the map ordered by ID?

          Show
          giovanni.fumarola Giovanni Matteo Fumarola added a comment - Thanks Arun Suresh and Subru Krishnan to start this. Everything looks good and easy to read and understand. Few Comments: The allocated Container(s) received as part of the response will have the ID corresponding to the original ResourceRequest for which the RM made the allocation. Are you going to add the ResourceRequestId as field of Container? You guys wrote At high level, we propose to add an ID field to ResourceRequest after few lines you wrote The resource-request data structure in RM’s AppSchedulingInfo will be updated to Map<Priority, Map<String, Map<Resource, Map<int, ResourceRequest>>>>. You will not need a int field in the map, because it is already part of the ResourceRequest. Do you want to keep the map ordered by ID?
          Hide
          subru Subru Krishnan added a comment -

          Everything looks good and easy to read and understand.

          Thanks Giovanni Matteo Fumarola for looking at our proposal and providing feedback.

          Are you going to add the ResourceRequestId as field of Container?

          Yes. That is how AMs can map the allocated container to their actual request.

          You will not need a int field in the map, because it is already part of the ResourceRequest. Do you want to keep the map ordered by ID?

          Good catch. This is the exact question that Arun Suresh raised. A list is sufficient but I am leaning towards Map for making the accounting process more simple/fast.
          For e.g.: If a node local allocation is made for node N1, we can immediately lookup the entries for rack and ANY by using the ID key and decrement them instead of linearly scanning the rack/ANY entries.
          Yes, we want to use a sorted map. The reason for doing that is we want to start with a FIFO allocation order.

          Show
          subru Subru Krishnan added a comment - Everything looks good and easy to read and understand. Thanks Giovanni Matteo Fumarola for looking at our proposal and providing feedback. Are you going to add the ResourceRequestId as field of Container? Yes. That is how AMs can map the allocated container to their actual request. You will not need a int field in the map, because it is already part of the ResourceRequest. Do you want to keep the map ordered by ID? Good catch. This is the exact question that Arun Suresh raised. A list is sufficient but I am leaning towards Map for making the accounting process more simple/fast. For e.g.: If a node local allocation is made for node N1, we can immediately lookup the entries for rack and ANY by using the ID key and decrement them instead of linearly scanning the rack/ANY entries. Yes, we want to use a sorted map. The reason for doing that is we want to start with a FIFO allocation order.
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks for putting the doc together with all the details. Would be nice to capture details on when we would delete the ResourceRequests altogether from AppSchedulingInfo.

          While making these changes, would it possible to address YARN-314 too?

          Show
          kasha Karthik Kambatla added a comment - Thanks for putting the doc together with all the details. Would be nice to capture details on when we would delete the ResourceRequests altogether from AppSchedulingInfo. While making these changes, would it possible to address YARN-314 too?
          Hide
          leftnoteasy Wangda Tan added a comment -

          Overall looks good, thanks Subru Krishnan!

          This will be very useful to solve existing request tracking issues.

          Show
          leftnoteasy Wangda Tan added a comment - Overall looks good, thanks Subru Krishnan ! This will be very useful to solve existing request tracking issues.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Linked this to YARN-4902, which will be a longer term fix of resource request related issues.

          Show
          leftnoteasy Wangda Tan added a comment - Linked this to YARN-4902 , which will be a longer term fix of resource request related issues.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Tx for the doc, Subru Krishnan and Arun Suresh! +1 overall for a unique identifier.

          Comments on your doc

          • I'd rather call it "an enhancement to identify requests explicitly" instead of "simple (delta) allocate protocol". We used to use the phrase "delta protocol" in a slightly different context - see YARN-110.
          • The RM will attempt to allocate containers in decreasing sequence number order,

            Why are we putting priority semantics onto the ID? We should just follow the existing priority ordering.

          • In our proposal, we could potentially have requests for each container at worst case.

            It is both network / memory overhead as well as scheduler's CPU time. Till we move off to global scheduling completely, we should be cautious about this. Of course, by inverting the ResourceRequest and still keying by ResourceName in the API, we are limiting the total entries to be of the order of the cluster-size.
            I already suggested on YARN-1547 that we also have an upper limit on the total number of requests - see here. But I strongly suggest that we have additional limits on the total number of IDs that can be used - this will fit our narrative at YARN-4902 too.

          Comments from YARN-4902

          Copy-edit-pasting here a few comments that we posted in the document for YARN-4902, and those I think were not laid out in the doc explicitly. We were calling it Allocation-ID there, I guess I now like Request-ID better. If some or all of them make sense, you can add them to your doc

          • Scope: This ID is a unique identifier for different ResourceRequests from the same application - essentially IDs can conflict across applications.
          • Generation: The application should simply generate a unique identifier within the application - if not the client-libraries can do so if desired by the application.
          • Non-binding nature: Applications can continue to completely ignore the returned Allocation-ID in the response and use the allocation for any of their outstanding requests
          • Responses: The scheduler may return multiple responses corresponding to the same Allocation-ID - as and when scheduler returns allocations
          • Deeper details on updates: Similar to the current API, update of only selected fields against a previously existing Allocation-ID will only update the object (as opposed to replacing it). For e.g, say a ResourceRequest first gets created with Allocation-ID "76589" and with "host: *". A future ResourceRequest with the same Allocation-ID but with contents “rack05: 10” will only append the rack information to the existing object. This is how one can replace parts of an object and is similar to how the existing per-record-deltas based protocol works.
          • Deletes: Similarly, if one wishes to replace an entire ResourceRequest corresponding to a specific allocation-ID, they can simply cancel the corresponding ResourceRequest and submit a new one afresh.

          Other responses

          If a node local allocation is made for node N1, we can immediately lookup the entries for rack and ANY by using the ID key and decrement them instead of linearly scanning the rack/ANY entries.

          +1, ID is really the logical grouping key.

          While making these changes, would it possible to address YARN-314 too?

          I'm okay if we can get two in a shot, but I'd caution against risking this effort by blowing up the size.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Tx for the doc, Subru Krishnan and Arun Suresh ! +1 overall for a unique identifier. Comments on your doc I'd rather call it "an enhancement to identify requests explicitly" instead of "simple (delta) allocate protocol". We used to use the phrase "delta protocol" in a slightly different context - see YARN-110 . The RM will attempt to allocate containers in decreasing sequence number order, Why are we putting priority semantics onto the ID? We should just follow the existing priority ordering. In our proposal, we could potentially have requests for each container at worst case. It is both network / memory overhead as well as scheduler's CPU time. Till we move off to global scheduling completely, we should be cautious about this. Of course, by inverting the ResourceRequest and still keying by ResourceName in the API, we are limiting the total entries to be of the order of the cluster-size. I already suggested on YARN-1547 that we also have an upper limit on the total number of requests - see here . But I strongly suggest that we have additional limits on the total number of IDs that can be used - this will fit our narrative at YARN-4902 too. Comments from YARN-4902 Copy-edit-pasting here a few comments that we posted in the document for YARN-4902 , and those I think were not laid out in the doc explicitly. We were calling it Allocation-ID there, I guess I now like Request-ID better. If some or all of them make sense, you can add them to your doc Scope : This ID is a unique identifier for different ResourceRequests from the same application - essentially IDs can conflict across applications. Generation : The application should simply generate a unique identifier within the application - if not the client-libraries can do so if desired by the application. Non-binding nature : Applications can continue to completely ignore the returned Allocation-ID in the response and use the allocation for any of their outstanding requests Responses : The scheduler may return multiple responses corresponding to the same Allocation-ID - as and when scheduler returns allocations Deeper details on updates : Similar to the current API, update of only selected fields against a previously existing Allocation-ID will only update the object (as opposed to replacing it). For e.g, say a ResourceRequest first gets created with Allocation-ID "76589" and with "host: *" . A future ResourceRequest with the same Allocation-ID but with contents “rack05: 10” will only append the rack information to the existing object. This is how one can replace parts of an object and is similar to how the existing per-record-deltas based protocol works. Deletes : Similarly, if one wishes to replace an entire ResourceRequest corresponding to a specific allocation-ID, they can simply cancel the corresponding ResourceRequest and submit a new one afresh. Other responses If a node local allocation is made for node N1, we can immediately lookup the entries for rack and ANY by using the ID key and decrement them instead of linearly scanning the rack/ANY entries. +1, ID is really the logical grouping key. While making these changes, would it possible to address YARN-314 too? I'm okay if we can get two in a shot, but I'd caution against risking this effort by blowing up the size.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          BTW, for the federation related issue, does the client-library need to always generate these IDs? How does that interact with application generated IDs?

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - BTW, for the federation related issue, does the client-library need to always generate these IDs? How does that interact with application generated IDs?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I like the idea of having requestIDs on requests; helps us map from request to response. As you say, today things are an ugly hack with priority.

          I would also like to see if the allocated containers could support a role ID field too...nothing much, but enough that on an AM restart their role can be determined. That one, I'd keep separate from the request ID; they serve slightly different purposes. (I could have 5 requests outstanding for containers of role 4; I'd want to track those requests)

          Show
          stevel@apache.org Steve Loughran added a comment - I like the idea of having requestIDs on requests; helps us map from request to response. As you say, today things are an ugly hack with priority. I would also like to see if the allocated containers could support a role ID field too...nothing much, but enough that on an AM restart their role can be determined. That one, I'd keep separate from the request ID; they serve slightly different purposes. (I could have 5 requests outstanding for containers of role 4; I'd want to track those requests)
          Hide
          subru Subru Krishnan added a comment -

          Thanks Karthik Kambatla, Wangda Tan, Vinod Kumar Vavilapalli and Steve Loughran for taking a look at our proposal. PFA updated doc (v2) that incorporates your feedback.

          A few additional clarifications (have addressed rest of comments directly in the updated doc):

          While making these changes, would it possible to address YARN-314 too?

          I'm okay if we can get two in a shot, but I'd caution against risking this effort by blowing up the size.

          We will address YARN-314 as long as applications specify the Request-ID as they can request for multiple containers at same priority through independent requests.

          Why are we putting priority semantics onto the ID? We should just follow the existing priority ordering.

          We will continue to follow the existing priority ordering. But as explained above, with the proposed enhancement user can potentially make multiple requests at same priority (YARN-314). In such a scenario, we will simply allocate containers in FIFO order.

          BTW, for the federation related issue, does the client-library need to always generate these IDs? How does that interact with application generated IDs?

          In Federation also, we expect applications to generate the IDs. For e.g.: we will work with the REEF team (and the long running service AM proposed as part of YARN-4692) to start specifying IDs for their allocation requests.

          I would also like to see if the allocated containers could support a role ID field too...nothing much, but enough that on an AM restart their role can be determined. That one, I'd keep separate from the request ID; they serve slightly different purposes. (I could have 5 requests outstanding for containers of role 4; I'd want to track those requests)

          I agree that having an explicit role ID is useful but feel its outside the scope of this JIRA which IIUC is what you are also observing. I think adding a role ID should be part of YARN-4692/YARN-4902.

          Show
          subru Subru Krishnan added a comment - Thanks Karthik Kambatla , Wangda Tan , Vinod Kumar Vavilapalli and Steve Loughran for taking a look at our proposal. PFA updated doc (v2) that incorporates your feedback. A few additional clarifications (have addressed rest of comments directly in the updated doc): While making these changes, would it possible to address YARN-314 too? I'm okay if we can get two in a shot, but I'd caution against risking this effort by blowing up the size. We will address YARN-314 as long as applications specify the Request-ID as they can request for multiple containers at same priority through independent requests. Why are we putting priority semantics onto the ID? We should just follow the existing priority ordering. We will continue to follow the existing priority ordering. But as explained above, with the proposed enhancement user can potentially make multiple requests at same priority ( YARN-314 ). In such a scenario, we will simply allocate containers in FIFO order. BTW, for the federation related issue, does the client-library need to always generate these IDs? How does that interact with application generated IDs? In Federation also, we expect applications to generate the IDs. For e.g.: we will work with the REEF team (and the long running service AM proposed as part of YARN-4692 ) to start specifying IDs for their allocation requests. I would also like to see if the allocated containers could support a role ID field too...nothing much, but enough that on an AM restart their role can be determined. That one, I'd keep separate from the request ID; they serve slightly different purposes. (I could have 5 requests outstanding for containers of role 4; I'd want to track those requests) I agree that having an explicit role ID is useful but feel its outside the scope of this JIRA which IIUC is what you are also observing. I think adding a role ID should be part of YARN-4692 / YARN-4902 .
          Hide
          subru Subru Krishnan added a comment - - edited

          There was a discussion regarding the implementation details with Wangda Tan, Vinod Kumar Vavilapalli, Karthik Kambatla and Arun Suresh. The crux of the discussion was:

          • The request ID can be used for either identifying a single RR or a group of RRs. The latter is required if we want to encode the notion of relaxLocality - i.e. allocate a container in node N1, if not fallback to rack R1 and finally ANY. I will fix the RR API Javdocs in YARN-4887 accordingly.
          • The existing AMRMClientImpl is convoluted and has some vestigial behaviour that results in auto expansion of client RRs due to rack inference. Quite a bit of the complexity in AMRMClientImpl results from the matching logic which can be greatly simplified by using the request ID. So the consensus was to leave the existing AMRMClientImpl as-is for backward compatibility and start with a fresh version of AMRMClientImpl (v2) that will have a clean implementation using the request ID. I created YARN-5225 to track the first version of AMRMClientImpl (v2)
          Show
          subru Subru Krishnan added a comment - - edited There was a discussion regarding the implementation details with Wangda Tan , Vinod Kumar Vavilapalli , Karthik Kambatla and Arun Suresh . The crux of the discussion was: The request ID can be used for either identifying a single RR or a group of RRs. The latter is required if we want to encode the notion of relaxLocality - i.e. allocate a container in node N1, if not fallback to rack R1 and finally ANY. I will fix the RR API Javdocs in YARN-4887 accordingly. The existing AMRMClientImpl is convoluted and has some vestigial behaviour that results in auto expansion of client RRs due to rack inference. Quite a bit of the complexity in AMRMClientImpl results from the matching logic which can be greatly simplified by using the request ID. So the consensus was to leave the existing AMRMClientImpl as-is for backward compatibility and start with a fresh version of AMRMClientImpl (v2) that will have a clean implementation using the request ID. I created YARN-5225 to track the first version of AMRMClientImpl (v2)

            People

            • Assignee:
              subru Subru Krishnan
              Reporter:
              subru Subru Krishnan
            • Votes:
              2 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development