[HAMA-505] Fault Tolerant Job Processing - ASF JIRA

XML

Word

Printable

JSON

This umbrella summarizes all issues related with checkpointing and task restarting to archieve fault tolerance on the job level.

relates to

HAMA-504 Cluster High Availability

1.	Integrity of checkpointed data	Resolved	Suraj Menon
2.	Make configurable checkpointing	Resolved	Suraj Menon
3.	BSPTask should periodically ping its parent.	Resolved	Suraj Menon
4.	Chainable computations for fault tolerance	Resolved	Thomas Jungblut
5.	BSP Peer should have the ability to start with a non-zero superstep from a partition of checkpointed message for that task ID, attempt ID	Resolved	Unassigned
6.	For configurable number of attempts, BSPMaster should direct groomserver to run the recovery task on failure.	Resolved	Suraj Menon
7.	Add documentation to fault tolerant job processing	Open	Unassigned
8.	Implement Checkpointing service in Hama	Resolved	Suraj Menon
9.	Handle counters during task recovery	Open	Suraj Menon
10.	Recover tasks on failure of groom server	Open	Suraj Menon
11.	Rewrite the Examples and GraphJobRunner to the Superstep API	Resolved	Unassigned
12.	Confined recovery	Open	Unassigned