Affects Version/s: None
Fix Version/s: None
Add documentation to fault tolerant job processing
Suraj Menon made changes -
|Field||Original Value||New Value|
[ Reasons for
1. Make checkpointing asynchronous. Would reduce the delay during sync. The checkpoint function you mentioned would then be optional and could be used only if there is a need to sync with the ending of checkpointing process for that superstep. The design would be similar to how spilling is implemented in Hadoop.
2. Update BSPMaster with the checkpointing status of each task with superstep count and checkpoint file. (We can live without this as we have a convention for naming checkpointing files)
3. Tighter integration of Checkpointer with MessageManager.
I would be putting up a design document for HAMA-551. ]