Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-20752

FailureRateRestartBackoffTimeStrategy allows one less restart than configured

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • The Failure Rate Restart Strategy was allowing 1 less restart per interval as configured. Users wishing to keep the current behavior should reduce the maximum number of allowed failures per interval by 1.

    Description

      The FailureRateRestartBackoffTimeStrategy maintains a list of failure timestamps, keeping N timestamps where N is the configured of failures per interval.

      The timestamp is added when #notifyFailure() is called, and later evaluated within #canRestart().

      To determine whether a restart should be allowed we first check whether we are already storing N timestamps, and if so check whether the earliest failure still falls within the current interval. If it does, we reject the restart.

      The problem is that we check whether we have already stored exactly N timestamps. If we have exactly N timestamps, and we allow N failures per interval, then we should not reject the restart by definition. We should instead be checking whether N+1 timestamps have been stored.

      For example, let's say we allow 2 exceptions, and 2 have occurred so far. Regardless of what the timestamps are, we should still allow a restart in this case.
      Only once a third exception occurs should we be looking at the timestamps, and we should furthermore only look at the exception exceeding the allowed failure count; in this example it is the very first exception.

      I don't know why this went unnoticed for so long; the relevant tests fail rather reliably for me locally. (FailureRateRestartBackoffTimeStrategyTest, SimpleRecoveryFailureRateStrategyITBase)

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            chesnay Chesnay Schepler
            chesnay Chesnay Schepler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment