Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
This is the scenario.
Slave dies after checkpointing a terminal update but before the ACK reached the executor.
Recovered slave/status update manager retries the update and cleans it up after it gets an ACK from the scheduler.
When the executor re-registers after this point, it still has a pending update but the slave cannot find the executor for this update because the task is completed! Currently the slave forwards this update to the SUM anyway but never acks the executor.