Wangda Tan Thank you for your comments.
BTW - Your document is a great set of requirements, it was a real pleasure reading it.
Please see my answers.
1. Label Expression
>>>>> Label expression - logical combination of labels (using && and, || or, ! not)
It seems to me the label expression is too complex here, the expression will be verified when we making scheduling decision to allocate every container. We need consider performance.
I definitely agree that performance needs to be considered here.
Application level label expression is not going to change per application lifetime - so this can be cashed.
Queue level label expression is going to change only when Queue label is changed - so this can be cached per queue
So final expression to match together Queue Label, Application Label Expressions and QueueLabelPolicy does not need to be evaluated every ResourceRequest - again unless AppMaster dynamically assigns different labels per request for a container.
What probably needs to be evaluated is what nodes satisfy a final/effective LabelExpression, as nodes can come and go, labels on them can change
>>>> Another problem of this is, it will be make harder to calculate headroom of an application or capacity of a queue.
Thank you for pointing this out. I will double check on this
>>>> And it is not so straightforward for user/admin get how many nodes can satisfy a given label expression.
I am sure we can provide admin API/REST/UI to enter expression and get the result
>>>> IMHO, we can simply make node labels AND'ed, most scenarios will be coverer. It will be easier to eval and user can better understand as well.
Let me understand it better: If application provides multiple labels they are "AND"ed and so only nodes that have the same set of labels or their superset will be used?
2. Queue Policy
>>>> There're 4 policies mentioned in your proposal. We should reduce the complexity of configuration as much as possible.
>>>> At least, "OR" is no so meaningful to me here, do you have any usecase/example on this one?
Consider this as union of LabelExpression from Application and Queue. So if you have LabelExpression as "blue" and QueueExpression as "yellow"
You can allocate containers on the nodes that have either label "blue" or "yellow" (if you have some nodes that are not marked as such they won't be used). So unlike in case of "AND" where you can only run on nodes that marked as "blue" and "yellow" (subset)
I think AND should be enough to cover most usecases.
3. Labels Manager
>>>> 3.1 What's process of modifying the node label configuration? Since the file is stored on DFS, does admin modify the configuration on a local file, then upload it to DFS via "hadoop fs -copyFromLocal ..."? If yes, it will be hard for admin to configure.
Yes - so far this is a procedure. Not sure what is "hard" here, but we can have some API to do it.
We suggest centralized location for node labels such as file stored on DFS that all the YARN daemons
>>>> What's prospect to make it available to all YARN daemons? I think make it available to RM should be enough here.
Agree - that today this file may be only relevant to RM. If it is stored as local file or by other means it is greater chance for it to be overwritten, lost in upgrade process.
4. Specify labels in container level
I found you plan to add a labels field in ResourceRequest, and also mentioned by Bc Wong. I think we should support container level, user doesn't have to do it, it will be only used when specify labels at app-level is not enough.
>>>> Yes - if Application Level is not enough user can specify on request level, otherwise not necessarily. Though I can not say we looked closely at possibility of setting label on more granular level very closely (to address your next comment)
And if we support this, it will be not sufficient to change isBlackListed at AppSchedulingInfo only in scheduler to make fair/capacity scheduler works. We may need to modify implementations of different schedulers.
5. Label specification for hierarchy queues
We can only support specify labels in leaf queues, in existing scheduler configuration, like user-limit, etc. can be only specified on leaf queue, we can make them consistent. "The closest will be used." strategy will potentially cause some configuration issues as well.
>>>> Sure we can make them consistent, our thought process was that if you have multiple leaf queues that should share the same label/policy you can specify it on the parent level, so you don't need to "type" more then necessary
6. In "Considerations" part
If we assume that during life of the application none of those changes can take effect on the application
>>>> I think we can assume application will not change label expression during its lifecycle. But updating labels of node/queue should affect future scheduling considerations.
Yes, definitely application label expression is not going to change over application run time, I was referring to reevaluation of "final" label expression matching nodes for performance reasons.
And even if we assume queue/node labels not changed to an application, we still need to consider node add/remove dynamically in the cluster
>>>> When invalid label expression (consists of label(s) that are not present in the labels file) is used to define for Queue or Application it will be ignored as if no label was set. RM logs will have errors about usage of invalid labels
>>>> I think we should tell user this resource request is invalid, we cannot hide this error in RM logs. Because not every user can access logs of YARN daemons.
Completely agree that this needs to be propagated to the end user in some shape or form. Would love to hear your proposal in this area
If no node that satisfies final label evaluation is available Application will be waiting to be submitted.
>>>> In our proposal, AMS will reject if no node satisfies node label of a ResourceRequest. Because user may mis-filling node label in ResourceRequest.
>>>> We may need discuss which one will be better.
Absolutely - let's discuss it.