Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6183

TaskMetricGroup may not be cleanup when Task.run() is never called or exits early

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.2.0, 1.3.0
    • Fix Version/s: 1.3.0, 1.2.1
    • Component/s: Metrics
    • Labels:
      None

      Description

      The TaskMetricGroup is created when a Task is created. It is cleaned up at the end of Task.run() in the finally block. If however run() is never called due some failure between the creation and the call to run the metric group is never closed. This also means that the JobMetricGroup is never closed.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zentol closed the pull request at:

          https://github.com/apache/flink/pull/3610

          Show
          githubbot ASF GitHub Bot added a comment - Github user zentol closed the pull request at: https://github.com/apache/flink/pull/3610
          Hide
          Zentol Chesnay Schepler added a comment -

          1.2: 3c63c9e01c6f1ed10a4a686f1e977fbc6e1cd9fd
          1.3: dabb0bac0f724d50dcab5b3b767f38dc5feeb407

          Show
          Zentol Chesnay Schepler added a comment - 1.2: 3c63c9e01c6f1ed10a4a686f1e977fbc6e1cd9fd 1.3: dabb0bac0f724d50dcab5b3b767f38dc5feeb407
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3611

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3611
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rmetzger commented on the issue:

          https://github.com/apache/flink/pull/3611

          I think we can merge this

          Show
          githubbot ASF GitHub Bot added a comment - Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/3611 I think we can merge this
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rmetzger commented on the issue:

          https://github.com/apache/flink/pull/3610

          Not being an expert in this area of the code, I think this change is good to merge

          Show
          githubbot ASF GitHub Bot added a comment - Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/3610 Not being an expert in this area of the code, I think this change is good to merge
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zentol commented on the issue:

          https://github.com/apache/flink/pull/3610

          @StephanEwen I've addressed your comment. Do you have any other concerns or can i go ahead and merge this?

          Show
          githubbot ASF GitHub Bot added a comment - Github user zentol commented on the issue: https://github.com/apache/flink/pull/3610 @StephanEwen I've addressed your comment. Do you have any other concerns or can i go ahead and merge this?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zentol commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3610#discussion_r108503328

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/TaskManagerJobMetricGroup.java —
          @@ -80,8 +80,17 @@ public TaskMetricGroup addTask(
          taskName,
          subtaskIndex,
          attemptNumber);

          • tasks.put(executionAttemptID, task);
          • return task;
            + TaskMetricGroup prior = tasks.put(executionAttemptID, task);
            + if (prior == null) {
            + return task;
              • End diff –

          yes that would work as well.

          Show
          githubbot ASF GitHub Bot added a comment - Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/3610#discussion_r108503328 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/TaskManagerJobMetricGroup.java — @@ -80,8 +80,17 @@ public TaskMetricGroup addTask( taskName, subtaskIndex, attemptNumber); tasks.put(executionAttemptID, task); return task; + TaskMetricGroup prior = tasks.put(executionAttemptID, task); + if (prior == null) { + return task; End diff – yes that would work as well.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on a diff in the pull request:

          https://github.com/apache/flink/pull/3610#discussion_r108478998

          — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/TaskManagerJobMetricGroup.java —
          @@ -80,8 +80,17 @@ public TaskMetricGroup addTask(
          taskName,
          subtaskIndex,
          attemptNumber);

          • tasks.put(executionAttemptID, task);
          • return task;
            + TaskMetricGroup prior = tasks.put(executionAttemptID, task);
            + if (prior == null) {
            + return task;
              • End diff –

          Can you avoid adding `closeLocally()` by simply doing a "contains()" check before creating the group?

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on a diff in the pull request: https://github.com/apache/flink/pull/3610#discussion_r108478998 — Diff: flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/TaskManagerJobMetricGroup.java — @@ -80,8 +80,17 @@ public TaskMetricGroup addTask( taskName, subtaskIndex, attemptNumber); tasks.put(executionAttemptID, task); return task; + TaskMetricGroup prior = tasks.put(executionAttemptID, task); + if (prior == null) { + return task; End diff – Can you avoid adding `closeLocally()` by simply doing a "contains()" check before creating the group?
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user zentol opened a pull request:

          https://github.com/apache/flink/pull/3611

          [backport] FLINK-6183/FLINK-6184 Prevent some NPE and unclosed metric groups

          Backport of #3610 for 1.2 .

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/zentol/flink 6183_6184_metric_task_backport

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3611.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3611


          commit 790b3ce444e10191731850bad71c35fe050d9af3
          Author: zentol <chesnay@apache.org>
          Date: 2017-03-24T18:11:58Z

          FLINK-6184 Prevent NPE in buffer metrics

          commit 13e40466ffe63783c59cc979900ba7af2d693576
          Author: zentol <chesnay@apache.org>
          Date: 2017-03-24T18:39:31Z

          FLINK-6183 [metrics] Prevent some cases of TaskMG not being closed


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/3611 [backport] FLINK-6183 / FLINK-6184 Prevent some NPE and unclosed metric groups Backport of #3610 for 1.2 . You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink 6183_6184_metric_task_backport Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3611.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3611 commit 790b3ce444e10191731850bad71c35fe050d9af3 Author: zentol <chesnay@apache.org> Date: 2017-03-24T18:11:58Z FLINK-6184 Prevent NPE in buffer metrics commit 13e40466ffe63783c59cc979900ba7af2d693576 Author: zentol <chesnay@apache.org> Date: 2017-03-24T18:39:31Z FLINK-6183 [metrics] Prevent some cases of TaskMG not being closed
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user zentol opened a pull request:

          https://github.com/apache/flink/pull/3610

          FLINK-6183/FLINK-6184 Prevent some NPE and unclosed metric groups

          This PR fixes 2 issues:

          1) It prevents some NPEs in the buffer metrics by instantiating them after the task has been registered in the NetworkEnvironment.

          2) It prevents some cases where the TaskMetricGroup would never be closed. These cases include an early exit in `Task#run()` and when 2) tasks with an identical ExecutionAttemptID are run on the same TM.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/zentol/flink 6183_6184_metric_task

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3610.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3610



          Show
          githubbot ASF GitHub Bot added a comment - GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/3610 FLINK-6183 / FLINK-6184 Prevent some NPE and unclosed metric groups This PR fixes 2 issues: 1) It prevents some NPEs in the buffer metrics by instantiating them after the task has been registered in the NetworkEnvironment. 2) It prevents some cases where the TaskMetricGroup would never be closed. These cases include an early exit in `Task#run()` and when 2) tasks with an identical ExecutionAttemptID are run on the same TM. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink 6183_6184_metric_task Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3610.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3610

            People

            • Assignee:
              Zentol Chesnay Schepler
              Reporter:
              Zentol Chesnay Schepler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development