Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4486

JobManager not fully running when yarn-session.sh finishes

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0, 1.1.2
    • Component/s: YARN
    • Labels:
      None

      Description

      I start a detached yarn-session.sh.
      If the Yarn cluster is very busy then the yarn-session.sh script completes BEFORE all the task slots have been allocated. As a consequence I sometimes have a jobmanager without any task slots. Over time these task slots are assigned by the Yarn cluster but these are not available for the first job that is submitted.

      As a consequence I have found that the first few tasks in my job fail with this error "Not enough free slots available to run the job.".

      I think the desirable behavior is that yarn-session waits until the jobmanager is fully functional and capable of actually running the jobs.

      org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (Read prefix '4') -> Map (Map prefix '4') (8/10)) @ (unassigned) - [SCHEDULED] > with groupID < cd6c37df290564e603da908a8783a9bf > in sharing group < SlotSharingGroup [c0b6eff6ce93967182cdb6dfeae9359b, 8b2c3b39f3a55adf9f123243ab03c9c1, 55fb94dd8a3e5f59a10dbbf5c4925db4, 433b2e4a05a5e685b48c517249755a89, 8c74690c35454064e4815ac3756cdca2, 4b4fbd24f3483030fd852b38ff2249c1, 5e36a56ea4dece18fe5ba04352d90dc8, cd6c37df290564e603da908a8783a9bf, 64eafa845087bee70735f7250df9994f, 706a5d6fe48ae57724a00a9fce5dae8a, 7bee4297e0e839e53a153dfcbcca8624, 21b58f7d408d237540ae7b4734f81a1d, b429b1ff338d9d73677f42717cfc0dbc, cc7491db641f557c6aa8c749ebc2de62, f61cbf0ae00331f67aaf60ace78b05aa, 606f02ea9e0f4ad57f0cc0232dd70842] >. Resources available to scheduler: Number of instances=1, total number of slots=7, available slots=0
      	at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
      	at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
      	at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:306)
      	at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:454)
      	at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:326)
      	at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:734)
      	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1332)
      	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1291)
      	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1291)
      	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
      	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
      	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
      	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
      	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
      	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
      	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      

        Issue Links

          Activity

          Hide
          mxm Maximilian Michels added a comment -

          master: ab1df63c3fd419c23631f3b55b506e6fdf3cb72f
          release-1.1: 4cdeb11854956ac6cf1189d7cfa43628fb3be328

          Show
          mxm Maximilian Michels added a comment - master: ab1df63c3fd419c23631f3b55b506e6fdf3cb72f release-1.1: 4cdeb11854956ac6cf1189d7cfa43628fb3be328
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/2423

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2423
          Hide
          till.rohrmann Till Rohrmann added a comment -

          I think that's the right way to fix the problem right now. With Flip-6 it will probably no longer be necessary since the JobManager will be able to allocate slots and wait for their allocation which is currently not the case.

          Show
          till.rohrmann Till Rohrmann added a comment - I think that's the right way to fix the problem right now. With Flip-6 it will probably no longer be necessary since the JobManager will be able to allocate slots and wait for their allocation which is currently not the case.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user mxm opened a pull request:

          https://github.com/apache/flink/pull/2423

          FLINK-4486 detached YarnSession: wait until cluster startup is complete

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/mxm/flink FLINK-4486

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2423.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2423


          commit 2ec30c2c25204ff04270db9d072085f85909c8be
          Author: Maximilian Michels <mxm@apache.org>
          Date: 2016-08-26T10:06:36Z

          FLINK-4486 detached YarnSession: wait until cluster startup is complete


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user mxm opened a pull request: https://github.com/apache/flink/pull/2423 FLINK-4486 detached YarnSession: wait until cluster startup is complete You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/flink FLINK-4486 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2423.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2423 commit 2ec30c2c25204ff04270db9d072085f85909c8be Author: Maximilian Michels <mxm@apache.org> Date: 2016-08-26T10:06:36Z FLINK-4486 detached YarnSession: wait until cluster startup is complete
          Hide
          mxm Maximilian Michels added a comment -

          The YarnClusterClient delays checking for the startup of the cluster until actual client-side operations are made (e.g job submission). The yarn-session checks for cluster startup when you execute it without the detached -d flag but not in detached mode because there is no cluster interaction afterwards. We should fix that.

          Show
          mxm Maximilian Michels added a comment - The YarnClusterClient delays checking for the startup of the cluster until actual client-side operations are made (e.g job submission). The yarn-session checks for cluster startup when you execute it without the detached -d flag but not in detached mode because there is no cluster interaction afterwards. We should fix that.
          Hide
          aljoscha Aljoscha Krettek added a comment -

          Or that a JobManager takes the job and blocks and waits until it gets the required resources from Yarn. I think this is already being addressed by the work on FLIP-6, if I'm not mistaken.

          Could Stephan Ewen or Till Rohrmann chime in on this?

          Show
          aljoscha Aljoscha Krettek added a comment - Or that a JobManager takes the job and blocks and waits until it gets the required resources from Yarn. I think this is already being addressed by the work on FLIP-6, if I'm not mistaken. Could Stephan Ewen or Till Rohrmann chime in on this?

            People

            • Assignee:
              mxm Maximilian Michels
              Reporter:
              nielsbasjes Niels Basjes
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development