Uploaded image for project: 'REEF (Retired)'
  1. REEF (Retired)
  2. REEF-1895

REEF Bridge performance improvement for allocated evaluators

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.17
    • REEF, REEF Bridge
    • None

    Description

      Recent scale tests show there are a few places in the REEF code, mainly in bridge code that seriously impact the REEF performance and scalability. Notably:

      -Syncronized(this) in BridgeDriver in event handlers, especially Allocated Evaluator handlers. That make the events are handled in sequence. When requesting a few thousands evaluators, the slowness is dramatic.
      -A lock on Evaluators when receiving allocated evaluator in bridge, that increases the execution time in minutes level. And the matching logic in this code is not used at all.
      -Some variables can be reused but they are computed for each evaluator especially cross bridge calls. When the number of evaluators reaches to a few thousands, the time spent is obvious.

      After an evaluator is allocated, if YARN doesn't receive launch command within time out time, it will throw failed evaluator. With the current code, we can not even launch two thousand containers before timeout from .Net side.

      This JIRA is to make improvement for allocated evaluators so that to increase the scalability.

      Attachments

        Issue Links

          Activity

            People

              juliaw Julia Wang
              juliaw Julia Wang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: