Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.4, 0.5.0
-
None
Description
if a checkpointing backend system becomes unresponsive (e.g. stalled NFS), and that a series of recoveries is to proceed (for instance, startup or failover), then each checkpoint fetching operation will block, wait for a timeout or another kind of exception, and the system will then continue without recovering this PE.
We should provide a way to detect this pattern (multiple backend fetches failures in a short amount of time) and temporarily disable fetching from the backend, in order to reduce blocking when backend becomes unresponsive.
Attachments
Issue Links
- Is contained by
-
S4-11 add checkpointing mechanism to s4-piper
- Resolved