Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-14949

Task cancellation can be stuck against out-of-thread error

    XMLWordPrintableJSON

    Details

      Description

      Task cancellation (cancelOrFailAndCancelInvokable) relies on multiple separate threads, which are TaskCanceler, TaskInterrupter, and TaskCancelerWatchdog. While TaskCanceler performs cancellation itself, TaskInterrupter periodically interrupts a non-reacting task and TaskCancelerWatchdog kills JVM if cancellation has never been finished within a certain amount of time (by default 3 min). Those all ensure that cancellation can be done or either aborted transitioning to a terminal state in finite time (FLINK-4715).

      However, if any asynchronous thread creation is failed such as by out-of-thread (java.lang.OutOfMemoryError: unable to create new native thread), the code transitions to CANCELING, but nothing could be performed for cancellation or watched by watchdog. Currently, jobmanager does retry cancellation against any error returned, but a next retry returns success once it sees CANCELING, assuming that it is in progress. This leads to complete stuck in CANCELING, which is non-terminal, so state machine is stuck after that.

      One solution would be that if a task has transitioned to CANCELLING but it gets fatal error or OOM (i.e., isJvmFatalOrOutOfMemoryError is true) indicating that it could not reach spawning TaskCancelerWatchdog, it could immediately consider that as fatal error (not safely cancellable) calling notifyFatalError, just as TaskCancelerWatchdog does but eagerly and synchronously. That way, it can at least transition out of the non-terminal state and furthermore clear potentially leaked thread/memory by restarting JVM. The same method is also invoked by failExternally, but transitioning to FAILED seems less critical as it's already terminal state.

      How to reproduce is straightforward by running an application that keeps creating threads, each of which never finishes in a loop, and has multiple tasks so that one task triggers failure and then the others are attempted to be cancelled by full fail-over. In web UI dashboard, some tasks from a task manager where any of cancellation-related threads failed to be spawned are stuck in CANCELLING for good.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hwanju Hwanju Kim
                Reporter:
                hwanju Hwanju Kim
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m