Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5718

Handle JVM Fatal Exceptions in Tasks

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: Local Runtime
    • Labels:
      None

      Description

      The TaskManager catches and handles all types of exceptions right now (all Throwables). The intention behind that is:

      • Many Error subclasses are recoverable for the TaskManagers, such as failure to load/link user code
      • We want to give eager notifications to the JobManager in case something in a task goes wrong.

      However, there are some exceptions which should probably simply terminate the JVM, if caught in the task thread, because they may leave the JVM in a dysfunctional limbo state:

      • OutOfMemoryError
      • InternalError
      • UnknownError
      • ZipError

      These are basically the subclasses of VirtualMachineError, except for StackOverflowError, which is recoverable and usually recovered already by the time the exception has been thrown and the stack unwound.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3811

          Only for PR which are opened against another branch than `master`. For `master` I can close it. Thanks again for your work.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3811 Only for PR which are opened against another branch than `master`. For `master` I can close it. Thanks again for your work.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zimmermatt commented on the issue:

          https://github.com/apache/flink/pull/3811

          Done @tillrohrmann. This is my first github pull request, so I didn't know I needed to do that manually

          Show
          githubbot ASF GitHub Bot added a comment - Github user zimmermatt commented on the issue: https://github.com/apache/flink/pull/3811 Done @tillrohrmann. This is my first github pull request, so I didn't know I needed to do that manually
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zimmermatt closed the pull request at:

          https://github.com/apache/flink/pull/3811

          Show
          githubbot ASF GitHub Bot added a comment - Github user zimmermatt closed the pull request at: https://github.com/apache/flink/pull/3811
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user zimmermatt reopened a pull request:

          https://github.com/apache/flink/pull/3811

          FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions.

          Manually applied and adapted commit dfc6fba5b9830e6a7804a6a0c9f69b36bf772730 for
          the `release-1.2` branch.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/zimmermatt/flink release-1.2

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3811.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3811


          commit d50acea7ab7d53454de761a4391159ab81dbd63c
          Author: Matt Zimmer <zimmermatt@netflix.com>
          Date: 2017-05-02T23:46:13Z

          FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions.

          Manually applied and adapted commit dfc6fba5b9830e6a7804a6a0c9f69b36bf772730 for
          the `release-1.2` branch.

          commit fb3d99002e289b667e7a1533277e90d6186751e8
          Author: Matt Zimmer <zimmermatt@netflix.com>
          Date: 2017-05-03T18:04:17Z

          Merge remote-tracking branch 'upstream/release-1.2' into release-1.2


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user zimmermatt reopened a pull request: https://github.com/apache/flink/pull/3811 FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions. Manually applied and adapted commit dfc6fba5b9830e6a7804a6a0c9f69b36bf772730 for the `release-1.2` branch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zimmermatt/flink release-1.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3811.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3811 commit d50acea7ab7d53454de761a4391159ab81dbd63c Author: Matt Zimmer <zimmermatt@netflix.com> Date: 2017-05-02T23:46:13Z FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions. Manually applied and adapted commit dfc6fba5b9830e6a7804a6a0c9f69b36bf772730 for the `release-1.2` branch. commit fb3d99002e289b667e7a1533277e90d6186751e8 Author: Matt Zimmer <zimmermatt@netflix.com> Date: 2017-05-03T18:04:17Z Merge remote-tracking branch 'upstream/release-1.2' into release-1.2
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zimmermatt closed the pull request at:

          https://github.com/apache/flink/pull/3811

          Show
          githubbot ASF GitHub Bot added a comment - Github user zimmermatt closed the pull request at: https://github.com/apache/flink/pull/3811
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3811

          I've merged your PR. Please close this PR since it does not get closed automatically if the commit is not merged into the master.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3811 I've merged your PR. Please close this PR since it does not get closed automatically if the commit is not merged into the master.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3811

          Thanks for your contribution @zimmermatt. Changes look good to me. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3811 Thanks for your contribution @zimmermatt. Changes look good to me. Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zimmermatt commented on the issue:

          https://github.com/apache/flink/pull/3811

          @tillrohrmann, this is the port of FLINK-5718 to the `release-1.2` branch I mentioned. It mostly transferred over, but I needed to make some judgement calls in `TaskManagerConfiguration`, `TaskManagerRuntimeInfo`, and `JvmExitOnFatalErrorTest`.

          Please let me know if you see anything that should be done differently.

          Show
          githubbot ASF GitHub Bot added a comment - Github user zimmermatt commented on the issue: https://github.com/apache/flink/pull/3811 @tillrohrmann, this is the port of FLINK-5718 to the `release-1.2` branch I mentioned. It mostly transferred over, but I needed to make some judgement calls in `TaskManagerConfiguration`, `TaskManagerRuntimeInfo`, and `JvmExitOnFatalErrorTest`. Please let me know if you see anything that should be done differently.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user zimmermatt opened a pull request:

          https://github.com/apache/flink/pull/3811

          FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions.

          Manually applied and adapted commit dfc6fba5b9830e6a7804a6a0c9f69b36bf772730 for
          the `release-1.2` branch.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/zimmermatt/flink release-1.2

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3811.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3811


          commit d50acea7ab7d53454de761a4391159ab81dbd63c
          Author: Matt Zimmer <zimmermatt@netflix.com>
          Date: 2017-05-02T23:46:13Z

          FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions.

          Manually applied and adapted commit dfc6fba5b9830e6a7804a6a0c9f69b36bf772730 for
          the `release-1.2` branch.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user zimmermatt opened a pull request: https://github.com/apache/flink/pull/3811 FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions. Manually applied and adapted commit dfc6fba5b9830e6a7804a6a0c9f69b36bf772730 for the `release-1.2` branch. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zimmermatt/flink release-1.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3811.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3811 commit d50acea7ab7d53454de761a4391159ab81dbd63c Author: Matt Zimmer <zimmermatt@netflix.com> Date: 2017-05-02T23:46:13Z FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions. Manually applied and adapted commit dfc6fba5b9830e6a7804a6a0c9f69b36bf772730 for the `release-1.2` branch.
          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed via dfc6fba5b9830e6a7804a6a0c9f69b36bf772730

          Show
          StephanEwen Stephan Ewen added a comment - Fixed via dfc6fba5b9830e6a7804a6a0c9f69b36bf772730
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3276

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3276
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3276

          Addressing the comment and merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3276 Addressing the comment and merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3276#discussion_r100128721

          — Diff: docs/setup/config.md —
          @@ -86,7 +86,7 @@ The default fraction for managed memory can be adjusted using the `taskmanager.m

          • `taskmanager.memory.segment-size`: The size of memory buffers used by the memory manager and the network stack in bytes (DEFAULT: 32768 (= 32 KiBytes)).

          – `taskmanager.memory.preallocate`: Can be either of `true` or `false`. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When `taskmanager.memory.off-heap` is set to `true`, then it is advised that this configuration is also set to `true`. If this configuration is set to `false` cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC.
          +- `taskmanager.memory.preallocate`: Can be either of `true` or `false`. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When `taskmanager.memory.off-heap` is set to `true`, then it is advised that this configuration is also set to `true`. If this configuration is set to `false` cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC. *Note:* For streaming setups, we highly recommend to set this value to `false` as the core state backends currently do not use the managed memory.
          — End diff –

          That would probably be good

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/3276#discussion_r100128721 — Diff: docs/setup/config.md — @@ -86,7 +86,7 @@ The default fraction for managed memory can be adjusted using the `taskmanager.m `taskmanager.memory.segment-size`: The size of memory buffers used by the memory manager and the network stack in bytes (DEFAULT: 32768 (= 32 KiBytes)). – `taskmanager.memory.preallocate`: Can be either of `true` or `false`. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When `taskmanager.memory.off-heap` is set to `true`, then it is advised that this configuration is also set to `true`. If this configuration is set to `false` cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC. +- `taskmanager.memory.preallocate`: Can be either of `true` or `false`. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When `taskmanager.memory.off-heap` is set to `true`, then it is advised that this configuration is also set to `true`. If this configuration is set to `false` cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC. * Note: * For streaming setups, we highly recommend to set this value to `false` as the core state backends currently do not use the managed memory. — End diff – That would probably be good
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3276#discussion_r99674991

          — Diff: docs/setup/config.md —
          @@ -86,7 +86,7 @@ The default fraction for managed memory can be adjusted using the `taskmanager.m

          • `taskmanager.memory.segment-size`: The size of memory buffers used by the memory manager and the network stack in bytes (DEFAULT: 32768 (= 32 KiBytes)).

          – `taskmanager.memory.preallocate`: Can be either of `true` or `false`. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When `taskmanager.memory.off-heap` is set to `true`, then it is advised that this configuration is also set to `true`. If this configuration is set to `false` cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC.
          +- `taskmanager.memory.preallocate`: Can be either of `true` or `false`. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When `taskmanager.memory.off-heap` is set to `true`, then it is advised that this configuration is also set to `true`. If this configuration is set to `false` cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC. *Note:* For streaming setups, we highly recommend to set this value to `false` as the core state backends currently do not use the managed memory.
          — End diff –

          Should this warning also be added to `flink-conf.yaml`?

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/3276#discussion_r99674991 — Diff: docs/setup/config.md — @@ -86,7 +86,7 @@ The default fraction for managed memory can be adjusted using the `taskmanager.m `taskmanager.memory.segment-size`: The size of memory buffers used by the memory manager and the network stack in bytes (DEFAULT: 32768 (= 32 KiBytes)). – `taskmanager.memory.preallocate`: Can be either of `true` or `false`. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When `taskmanager.memory.off-heap` is set to `true`, then it is advised that this configuration is also set to `true`. If this configuration is set to `false` cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC. +- `taskmanager.memory.preallocate`: Can be either of `true` or `false`. Specifies whether task managers should allocate all managed memory when starting up. (DEFAULT: false). When `taskmanager.memory.off-heap` is set to `true`, then it is advised that this configuration is also set to `true`. If this configuration is set to `false` cleaning up of the allocated offheap memory happens only when the configured JVM parameter MaxDirectMemorySize is reached by triggering a full GC. * Note: * For streaming setups, we highly recommend to set this value to `false` as the core state backends currently do not use the managed memory. — End diff – Should this warning also be added to `flink-conf.yaml`?
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user StephanEwen opened a pull request:

          https://github.com/apache/flink/pull/3276

          FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions.

          This adds a feature requested by a user for production stability.

          Certain exceptions should not be attempted to be handled by the TaskManager, because they indicate that the JVM is corrupt. When the task throws such an exception, the TaskManager simply forcefully and immediately exits the JVM.

          Optionally, the `OutOfMemoryError` can also be set to cause such immediate JVM termination, via the `taskmanager.jvm-exit-on-oom` config option.

              1. Tests

          This adds a test that tests the option and the actual process kill (via a spawned test process).

              1. Documentation

          This adds the `taskmanager.jvm-exit-on-oom` to the `setup/config.md` docs.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/StephanEwen/incubator-flink exit_on_fatal_error

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3276.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3276


          commit 21c08817554e5a66186afa83158ca9c6ac975ba4
          Author: Stephan Ewen <sewen@apache.org>
          Date: 2017-02-06T14:52:39Z

          FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user StephanEwen opened a pull request: https://github.com/apache/flink/pull/3276 FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions. This adds a feature requested by a user for production stability. Certain exceptions should not be attempted to be handled by the TaskManager, because they indicate that the JVM is corrupt. When the task throws such an exception, the TaskManager simply forcefully and immediately exits the JVM. Optionally, the `OutOfMemoryError` can also be set to cause such immediate JVM termination, via the `taskmanager.jvm-exit-on-oom` config option. Tests This adds a test that tests the option and the actual process kill (via a spawned test process). Documentation This adds the `taskmanager.jvm-exit-on-oom` to the `setup/config.md` docs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/StephanEwen/incubator-flink exit_on_fatal_error Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3276.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3276 commit 21c08817554e5a66186afa83158ca9c6ac975ba4 Author: Stephan Ewen <sewen@apache.org> Date: 2017-02-06T14:52:39Z FLINK-5718 [core] TaskManagers exit the JVM on fatal exceptions.

            People

            • Assignee:
              StephanEwen Stephan Ewen
              Reporter:
              StephanEwen Stephan Ewen
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development