Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-16511

Task cancellation timeout is not effective on OOM errors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • Runtime / Task
    • None

    Description

      Under high memory pressure, the task manager shutdown on fatal errors is not reliable:

      If a task does not cooperate and cannot be canceled and there is a OOM when starting the task cancellation watchdog thread, the exception is not propagated correctly. The reason for this is that the job manager retries the cancelTask() request multiple times. The operation is stateful and if we fail to start the watchdog thread, we won't attempt it again as the task already switches to the CANCELING state before starting the watchdog thread.

      Such fatal errors should automatically shutdown the task manager without a retry form the job manager side.

      Attachments

        Issue Links

          Activity

            People

              mxm Maximilian Michels
              mxm Maximilian Michels
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: