Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-17514

TaskCancelerWatchdog does not kill TaskManager

    XMLWordPrintableJSON

Details

    Description

      The watchdog reports a fatal error using taskManager.notifyFatalError(msg, null). This should normally lead to the TaskManager being terminated. The code introduced in FLINK-16225
      tries to look at the passed exception and will eventually fail with a NullPointerException, which prevents the TaskManager from being terminated.

      Stacktrace:

      2020-05-05 09:43:01,588 ERROR org.apache.flink.runtime.taskmanager.Task                     - Task did not exit gracefully within 180 + seconds.
      2020-05-05 09:43:01,588 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor            - Task did not exit gracefully within 180 + seconds.
      2020-05-05 09:43:01,588 ERROR org.apache.flink.runtime.taskmanager.Task                     - Error in Task Cancellation Watch Dog
      java.lang.NullPointerException
      	at org.apache.flink.util.ExceptionUtils.isOutOfMemoryErrorWithMessageStartingWith(ExceptionUtils.java:186)
      	at org.apache.flink.util.ExceptionUtils.isMetaspaceOutOfMemoryError(ExceptionUtils.java:170)
      	at org.apache.flink.util.ExceptionUtils.enrichTaskManagerOutOfMemoryError(ExceptionUtils.java:144)
      	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.onFatalError(TaskManagerRunner.java:249)
      	at org.apache.flink.runtime.taskexecutor.TaskExecutor$TaskManagerActionsImpl.notifyFatalError(TaskExecutor.java:1751)
      	at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1514)
      	at java.lang.Thread.run(Thread.java:748)
      

      Attachments

        Issue Links

          Activity

            People

              trohrmann Till Rohrmann
              aljoscha Aljoscha Krettek
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: