Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-9661

TTL state should support to do time shift after restoring from checkpoint (savepoint).

    XMLWordPrintableJSON

Details

    Description

      The initial version of the TTL-state appends the expired timestamp along the state record, and check the expired timestamp with the condition expired_timestamp <= current_time when accessing the state, if it is true then the record is expired, otherwise it is still alive. This could works pretty fine in the most cases, but in some case, we need to do time shift, otherwise it may cause some unexpected result when using the ProccessTime, I roughly describe two case as follow.

      • when restoring the job from the savepoint

      For example, the user set the TTL to 2h for the state, if he trigger a savepoint and restore the job from the savepoint after 2h(maybe some reason that delay he to restore the job quickly), then the restored job's previous state data are all expired.

      • when the job spend a long time to recover from a failure

      For example, there are many jobs running on a yarn session cluster, and the cluster configured to use the DFS to store the checkpoint data, but unfortunately, the DFS meet a strange problem which makes the jobs on the cluster begin to loop in recovery-fail-recovery-fail... the devs spend some time to address the issue of DFS and the jobs start working properly, but if the "system down time >= TTL" then the job's previous state data will be expired in this case.

      To avoid the problems as above, we need to do time shift after the job recovering from checkpoint & savepoint. A possible approach is outlined in 6186.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sihuazhou Sihua Zhou
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: