Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Recent scale tests show there are a few places in the REEF code, mainly in bridge code that seriously impact the REEF performance and scalability. Notably:
-Syncronized(this) in BridgeDriver in event handlers, especially Allocated Evaluator handlers. That make the events are handled in sequence. When requesting a few thousands evaluators, the slowness is dramatic.
-A lock on Evaluators when receiving allocated evaluator in bridge, that increases the execution time in minutes level. And the matching logic in this code is not used at all.
-Some variables can be reused but they are computed for each evaluator especially cross bridge calls. When the number of evaluators reaches to a few thousands, the time spent is obvious.
After an evaluator is allocated, if YARN doesn't receive launch command within time out time, it will throw failed evaluator. With the current code, we can not even launch two thousand containers before timeout from .Net side.
This JIRA is to make improvement for allocated evaluators so that to increase the scalability.
Attachments
Issue Links
- relates to
-
REEF-1894 Remove locks in event handlers in Java bridge JobDriver
- Open