Details
-
Umbrella
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
This umbrella summarizes all issues related with checkpointing and task restarting to archieve fault tolerance on the job level.
Attachments
Issue Links
- relates to
-
HAMA-504 Cluster High Availability
- Open
1.
|
Integrity of checkpointed data | Resolved | Suraj Menon | |
2.
|
Make configurable checkpointing | Resolved | Suraj Menon | |
3.
|
BSPTask should periodically ping its parent. | Resolved | Suraj Menon | |
4.
|
Chainable computations for fault tolerance | Resolved | Thomas Jungblut | |
5.
|
BSP Peer should have the ability to start with a non-zero superstep from a partition of checkpointed message for that task ID, attempt ID | Resolved | Unassigned | |
6.
|
For configurable number of attempts, BSPMaster should direct groomserver to run the recovery task on failure. | Resolved | Suraj Menon | |
7.
|
Add documentation to fault tolerant job processing | Open | Unassigned | |
8.
|
Implement Checkpointing service in Hama | Resolved | Suraj Menon | |
9.
|
Handle counters during task recovery | Open | Suraj Menon | |
10.
|
Recover tasks on failure of groom server | Open | Suraj Menon | |
11.
|
Rewrite the Examples and GraphJobRunner to the Superstep API | Resolved | Unassigned | |
12.
|
Confined recovery | Open | Unassigned |