[FLINK-9661] TTL state should support to do time shift after restoring from checkpoint (savepoint). - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:
- auto-deprioritized-major
- auto-deprioritized-minor

Description

The initial version of the TTL-state appends the expired timestamp along the state record, and check the expired timestamp with the condition expired_timestamp <= current_time when accessing the state, if it is true then the record is expired, otherwise it is still alive. This could works pretty fine in the most cases, but in some case, we need to do time shift, otherwise it may cause some unexpected result when using the ProccessTime, I roughly describe two case as follow.

when restoring the job from the savepoint

For example, the user set the TTL to 2h for the state, if he trigger a savepoint and restore the job from the savepoint after 2h(maybe some reason that delay he to restore the job quickly), then the restored job's previous state data are all expired.

when the job spend a long time to recover from a failure

For example, there are many jobs running on a yarn session cluster, and the cluster configured to use the DFS to store the checkpoint data, but unfortunately, the DFS meet a strange problem which makes the jobs on the cluster begin to loop in recovery-fail-recovery-fail... the devs spend some time to address the issue of DFS and the jobs start working properly, but if the "system down time >= TTL" then the job's previous state data will be expired in this case.

To avoid the problems as above, we need to do time shift after the job recovering from checkpoint & savepoint. A possible approach is outlined in 6186.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sihua Zhou

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jun/18 03:39

Updated:: 07/Dec/21 10:40