Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.3.0, 1.4.0
    • Component/s: Webfrontend
    • Labels:
      None

      Description

      In cases where TaskManagers fail, the web frontend in the Job Manager starts logging the exception below every few seconds.

      I labeled this as critical, because it actually makes debugging in such a situation complicated through a log that is flooded with noise.

      2017-05-03 19:37:07,823 WARN  org.apache.flink.runtime.webmonitor.metrics.MetricFetcher     - Fetching metrics failed.
      akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@herman:52175/user/MetricQueryService_136f717a6b91e248282cb2937d22088c]] after [10000 ms]
              at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
              at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
              at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
              at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
              at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
              at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
              at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
              at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
              at java.lang.Thread.run(Thread.java:745)
      

        Issue Links

          Activity

          Hide
          Zentol Chesnay Schepler added a comment - - edited

          I'm wondering what our options are here. We can't just disable the logging; there is the possibility that only the MetricQueryService is unreachable and this should be logged if that's the case.

          We could limit the # of log messages in a given time frame, but this would mean that an unreachable MQS may only be logged after a long long time.

          Finally, we could track the unreachable status of the MQS for each TaskManager; like a set that contains the paths. If a request fails it is added to the set, and we only log something when it is added to the set. Once a request succeeds it would be removed again. Problem is that we then would need some time-based clean-up code as the set could otherwise grow infinitely in cases where many TM's are being replaced (and thus are never reachable again).

          Sadly there isn't something like a TaskmanagerStatusListener interface, this would be useful to track/clean-up state by TaskManager.

          Show
          Zentol Chesnay Schepler added a comment - - edited I'm wondering what our options are here. We can't just disable the logging; there is the possibility that only the MetricQueryService is unreachable and this should be logged if that's the case. We could limit the # of log messages in a given time frame, but this would mean that an unreachable MQS may only be logged after a long long time. Finally, we could track the unreachable status of the MQS for each TaskManager; like a set that contains the paths. If a request fails it is added to the set, and we only log something when it is added to the set. Once a request succeeds it would be removed again. Problem is that we then would need some time-based clean-up code as the set could otherwise grow infinitely in cases where many TM's are being replaced (and thus are never reachable again). Sadly there isn't something like a TaskmanagerStatusListener interface, this would be useful to track/clean-up state by TaskManager .
          Hide
          StephanEwen Stephan Ewen added a comment -

          Just logging on debug might be a reasonable first fix.
          Ideally there is one info level log event at the first failed poll, and then more debug level logs for succeeding failed pools, but that requires state (and cleanup/expiry of that), which is not a road we should go down, I think...

          Show
          StephanEwen Stephan Ewen added a comment - Just logging on debug might be a reasonable first fix. Ideally there is one info level log event at the first failed poll, and then more debug level logs for succeeding failed pools, but that requires state (and cleanup/expiry of that), which is not a road we should go down, I think...
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user zentol opened a pull request:

          https://github.com/apache/flink/pull/3917

          FLINK-6440[metrics] Downgrade fetching failure logging to DEBUG

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/zentol/flink 6440_fetcher_log

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3917.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3917


          commit 372b4c9129643c97362b731d1f2d0598ccda7744
          Author: zentol <chesnay@apache.org>
          Date: 2017-05-16T08:19:18Z

          FLINK-6440[metrics] Downgrade fetching failure logging to DEBUG


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/3917 FLINK-6440 [metrics] Downgrade fetching failure logging to DEBUG You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink 6440_fetcher_log Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3917.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3917 commit 372b4c9129643c97362b731d1f2d0598ccda7744 Author: zentol <chesnay@apache.org> Date: 2017-05-16T08:19:18Z FLINK-6440 [metrics] Downgrade fetching failure logging to DEBUG
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3917

          +1 to this from my side. Makes logs in cases of TaskManager easier to search through.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3917 +1 to this from my side. Makes logs in cases of TaskManager easier to search through.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user rmetzger commented on the issue:

          https://github.com/apache/flink/pull/3917

          +1

          Show
          githubbot ASF GitHub Bot added a comment - Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/3917 +1
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zentol commented on the issue:

          https://github.com/apache/flink/pull/3917

          merging.

          Show
          githubbot ASF GitHub Bot added a comment - Github user zentol commented on the issue: https://github.com/apache/flink/pull/3917 merging.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zentol closed the pull request at:

          https://github.com/apache/flink/pull/3917

          Show
          githubbot ASF GitHub Bot added a comment - Github user zentol closed the pull request at: https://github.com/apache/flink/pull/3917
          Hide
          Zentol Chesnay Schepler added a comment -

          1.3: 5569c4fafb08755ef12b7a96a173170dad883184
          1.4: e03f1b52eee7e73d00846ec0dd102da808d9d63e

          Show
          Zentol Chesnay Schepler added a comment - 1.3: 5569c4fafb08755ef12b7a96a173170dad883184 1.4: e03f1b52eee7e73d00846ec0dd102da808d9d63e

            People

            • Assignee:
              Zentol Chesnay Schepler
              Reporter:
              StephanEwen Stephan Ewen
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development