Description
Currently in .Net Group Communication and IMRU scenario, if one of the Evaluator failed for whatever reason, all the Evaluators will be killed by the driver.
There are multiple levels of fault tolerant. The scenario we would like to support in this JIRA is:
- When an evaluator failed, this failed evaluator will be killed and other good Evaluators will stay, but all the tasks running on those Evaluators will be stopped.
- A new Evaluator will be requested and started with the original task.
- Same tasks will be resubmitted to the rest the Evaluators
- The topology of those tasks will be kept in the same group communication as before
- The data that have been downloaded in those good Evaluators will stay.
Attachments
Attachments
Issue Links
- contains
-
REEF-1325 Fix PoisonTest.TestPoisonedEvaluatorStartHanlder failure in AppVeyor
- Resolved
-
REEF-1305 Moving the communication group creation before submitting tasks and decouple evaluator/context requests from task creation
- Resolved
-
REEF-1399 Node stuck in group communication failure case
- Resolved
-
REEF-1420 Dispose IGroupCommClient/Network Service from IMRU tasks
- Resolved
-
REEF-1492 On IMRU recovery: if ResultHandler.Dispose() throws exception, IMRU Driver hangs.
- Resolved
-
REEF-1677 IMRU: Evaluators failed during WaitingForEvaluator phase don't count towards MaximumNumberOfEvaluatorFailures limit
- Resolved
-
REEF-1680 Increasing default retry count in WaitingForRegistration
- Resolved
-
REEF-1685 Complete the Job properly if update/master task is completed from running state
- Resolved
-
REEF-1683 Use default MaxRetryNumberInRecovery properly
- Resolved
-
REEF-1224 IMRU Fault Tolerance - Separate Data downloading from Task injection
- Resolved
-
REEF-1511 timeout for Task Shutdown during IMRU recovery
- Resolved
-
REEF-1549 Resolve the issue in WaitingForRegistration
- Resolved
-
REEF-1550 Clean up task exceptions in IMRU task hosts
- Resolved
-
REEF-1556 Add number of forced failures for IMRU fault tolerant testing
- Resolved
-
REEF-1452 Make PoisonException Serializable
- Resolved
-
REEF-1418 IMRU State management
- Closed
-
REEF-1304 Create tests which use .NET Poison to validate evaluator failure scenarios
- Resolved
-
REEF-1316 Adding test for resubmitting Evaluator
- Resolved
-
REEF-1317 Adding test for resubmitting tasks
- Resolved
-
REEF-1451 IMRU Fault Tolerant scenario testing
- Resolved
-
REEF-1682 Update TCP Connection config values for IMRU example and test
- Resolved
-
REEF-1366 Create tests which use .NET Poison to validate task failure scenarios
- Closed
-
REEF-1225 IMRU Fault Tolerance - Identify the failure cases in the Group Communication and events received
- Resolved
-
REEF-1226 Evaluator Fault Tolerant - Prototype
- Resolved
-
REEF-1248 Identify the scenarios that need to restart evaluators
- Resolved
-
REEF-1249 Add REEF Poison to REEF.NET
- Resolved
-
REEF-1251 IMRU Driver handlers for Fault Tolerant
- Resolved
-
REEF-1260 Adding a sample for Context Start handler
- Resolved
-
REEF-1318 TaskSubmitor - task preparation and submission in IMRU
- Resolved
-
REEF-1320 Creating default communication group in passive way
- Resolved
-
REEF-1321 Task Manager for Fault Tolerant
- Resolved
-
REEF-1322 Allow Communication Group to be removed from IGroupCommDriver
- Resolved
-
REEF-1327 Creating task states and state transitions for the IMRU Driver
- Resolved
-
REEF-1335 Create State Machine for IMRU fault tolerance
- Resolved
-
REEF-1339 Adding IInputPartition.Cache() for data download and cache
- Resolved
-
REEF-1340 Creating Context manager for Fault Tolerant
- Resolved
-
REEF-1345 Define task exceptions for IMRU Task
- Resolved
-
REEF-1378 Evaluator Manager for IMRU
- Resolved
-
REEF-1381 Allow to add Observer for ActiveContextManager
- Resolved
-
REEF-1386 Adding ICloseEvent handler for IMRU task
- Resolved
-
REEF-1392 Adding IObserver<ICloseEvent> for IMRU tasks
- Resolved
-
REEF-1404 IMRU task state Maintenance and Preservation in Evaluator for fault tolerant
- Resolved
-
REEF-1405 Cache the data in IMRU Context layer
- Resolved
-
REEF-1408 Creat IMRU functional test infrastructure and add tests for IMRU Task close handler
- Resolved
-
REEF-1466 Cancel the blocking message reading and close the task properly
- Resolved
-
REEF-1346 Throw proper exceptions in Evaluator and Context for fault tolerant
- Closed
- is blocked by
-
REEF-1423 Tasks are not disposed after they are closed
- Resolved
-
REEF-217 Evaluator reads the driver configuration (it shouldn't)
- Resolved
-
REEF-1278 IFailedEvaluator message leads to shutting down of JAVA side driver
- Resolved
-
REEF-1279 Injecting RuntimeClock in event handler creates second instance of clock
- Resolved
-
REEF-1388 Fix RunningTask to be sent for short-lived .NET tasks
- Resolved
-
REEF-1421 Transport Client inner thread is not canceled when object is disposed
- Resolved
-
REEF-1245 Upgrade Newtonsoft.Json package version to 7.0.1
- Resolved
- relates to
-
REEF-1410 Validate Task constructor failure => FailedTask Event
- Resolved
-
REEF-1424 Validate Task StartHandler failure => FailedTask Event
- Resolved
-
REEF-1428 Validate Task Stop failure => FailedTask Event
- Resolved
-
REEF-1439 Validate Exception in spun off System.Threading.Thread => FailedEvaluator Event
- Resolved
-
REEF-1447 Validate Task close failure => FailedEvaluator Event (task has not yet finished)
- Resolved
-
REEF-1285 Fix test issue in TestSendTaskMessage
- Resolved
-
REEF-1407 Catching exceptions in group communication are implemented incorrectly
- In Progress
-
REEF-1343 Fix events received in case of evaluator failure
- Resolved
-
REEF-1425 TaskClientCodeException override user task exceptions for TaskClose
- Resolved
-
REEF-1072 Add IDriverConnection as part of evaluator configuration
- Resolved
-
REEF-1691 Should not request extra evaluators if evaluator failed at WatingForEvaluator state
- Resolved
-
REEF-1280 The message returned from failed evaluator doesn't contain the real exception message
- Resolved
-
REEF-1294 Consider race condition in Evaluator.SetRuntimeHandlers
- Resolved
-
REEF-1692 Revert the ignorance for extra Evaluators requested
- Resolved
-
REEF-1397 Revise FileSystemInputPartition and RandomInputPartition with DataCache and DataMover
- Open
-
REEF-1208 Validate that the REEF.NET Evaluator logs and reports exceptions correctly
- Resolved
-
REEF-1267 Implement IActiveContext.SubmitContextAndService
- Resolved
-
REEF-1268 Complete ServiceConfiguration in REEF.NET
- Resolved
-
REEF-1365 Define semantics of IFileDeSerializer, CopyToLocal, and their usage in FileSystemInputPartition
- Open
-
REEF-1357 Allow different caching levels for caching in IInputPartition
- In Progress
-
REEF-769 Implement IFailedEvaluator
- Resolved
-
REEF-1286 Forward .NET Exceptions from the Evaluator to the Driver
- Resolved
-
REEF-1364 C# Evaluator should attempt to send a failure message back to the Driver on an unhandled Exception
- Resolved
-
REEF-796 Do Avro (de)serialization for FailedTask in bridge
- Resolved
-
REEF-1312 Convert IMRU.Examples to test
- Resolved
-
REEF-1256 Implementing Bridge for Task close for .Net
- Resolved
-
REEF-1257 Add TaskCloseEvent handler in TaskRuntime
- Resolved
-
REEF-1258 Populate Task Exception data properly
- Resolved
-
REEF-1289 Add an EvaluatorConfiguration for the .NET Evaluator
- Resolved
-
REEF-1392 Adding IObserver<ICloseEvent> for IMRU tasks
- Resolved
-
REEF-1238 Remove obsolete .NET code in O.A.R.Wake and O.A.R.Driver
- Resolved
-
REEF-1565 Make Context accessible from task
- Closed