Uploaded image for project: 'REEF (Retired)'
  1. REEF (Retired)
  2. REEF-1870

Kill slower Evaluators in IMRU after timeout in data loading

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 0.17
    • None
    • IMRU, REEF

    Description

      The job was submitted totally 4 retriesIn each retry, most of the Jobs can finish data downloading/deserialization within 6-30 minutes. There are about 3 evaluators which are very slow. The slowest one took about 2-8 hours to download data/deserialization in each retry. The retry was triggered after 30 min timeout (configurable)Driver cannot send close event to those slower evaluators before they complete data loading and then send IRunningTask event to driver. After long running time, the Job was killed.

      A simple band-aid is to kill the evaluators from which we do not receive RunningTask after the 30 min timeout along with cancelling the RunningTasks that have been received. Its needless to wait 8 hours to cancel the RunningTasks that just complete downloading/deserializing the data.

      Attachments

        Activity

          People

            juliaw Julia Wang
            juliaw Julia Wang
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: