Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6954

ExecutorAllocationManager can end up requesting a negative number of executors

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.1
    • Fix Version/s: 1.3.2, 1.4.0
    • Component/s: YARN
    • Labels:

      Description

      I have a simple test case for dynamic allocation on YARN that fails with the following stack trace-

      15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0
      java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -21 from the cluster manager. Please specify a positive number!
      	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
      	at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
      	at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
      	at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
      	at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
      	at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
      	at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
      	at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
      	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
      	at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

      My test is as follows-

      1. Start spark-shell with a single executor.
      2. Run a select count(*) query. The number of executors rises as input size is non-trivial.
      3. After the job finishes, the number of executors falls as most of them become idle.
      4. Rerun the same query again, and the request to add executors fails with the above error. In fact, the job itself continues to run with whatever executors it already has, but it never gets more executors unless the shell is closed and restarted.

      In fact, this error only happens when I configure executorIdleTimeout very small. For eg, I can reproduce it with the following configs-

      spark.dynamicAllocation.executorIdleTimeout     5
      spark.dynamicAllocation.schedulerBacklogTimeout 5
      

      Although I can simply increase executorIdleTimeout to something like 60 secs to avoid the error, I think this is still a bug to be fixed.

      The root cause seems that numExecutorsPending accidentally becomes negative if executors are killed too aggressively (i.e. executorIdleTimeout is too small) because under that circumstance, the new target # of executors can be smaller than the current # of executors. When that happens, ExecutorAllocationManager ends up trying to add a negative number of executors, which throws an exception.

      1. with_fix.png
        308 kB
        Cheolsoo Park
      2. without_fix.png
        217 kB
        Cheolsoo Park

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'piaozhexiu' has created a pull request for this issue:
          https://github.com/apache/spark/pull/5536

          Show
          apachespark Apache Spark added a comment - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/5536
          Hide
          sandyr Sandy Ryza added a comment -

          Hi Cheolsoo Park, are you running with a version of Spark that contains SPARK-6325? (1.3.0 does not).

          Show
          sandyr Sandy Ryza added a comment - Hi Cheolsoo Park , are you running with a version of Spark that contains SPARK-6325 ? (1.3.0 does not).
          Hide
          cheolsoo Cheolsoo Park added a comment -

          Hi Sandy Ryza, Thank you for the question.

          I am actually running 1.3.1-RC3, and I just confirmed that SPARK-6325 is in the commit log of my release branch.

          I updated the affects version to 1.3.1 to avoid confusion.

          Show
          cheolsoo Cheolsoo Park added a comment - Hi Sandy Ryza , Thank you for the question. I am actually running 1.3.1-RC3, and I just confirmed that SPARK-6325 is in the commit log of my release branch. I updated the affects version to 1.3.1 to avoid confusion.
          Hide
          cheolsoo Cheolsoo Park added a comment -

          I am uploading two diagrams that shows how the following variables move over time w/ and w/o my patch-

          • numExecutorsPending
          • executorIds.size
          • executorsPendingToRemove.size
          • targetNumExecutors
          1. The with_fix.png shows 4 consecutive runs of my query. As can be seen, targetNumExecutors and numExecutorsPending stays above zero.
          2. The without_fix.png shows a single run of my query. As can be seen, targetNumExecutors and numExecutorsPending goes negative after the 1st run.

          Here is how I collected data in the source code-

          private def targetNumExecutors(): Int = {
            logInfo("ZZZ " +
              numExecutorsPending + "," +
              executorIds.size + "," +
              executorsPendingToRemove.size + "," +
              (numExecutorsPending + executorIds.size - executorsPendingToRemove.size))
            numExecutorsPending + executorIds.size - executorsPendingToRemove.size
          }
          
          Show
          cheolsoo Cheolsoo Park added a comment - I am uploading two diagrams that shows how the following variables move over time w/ and w/o my patch- numExecutorsPending executorIds.size executorsPendingToRemove.size targetNumExecutors The with_fix.png shows 4 consecutive runs of my query. As can be seen, targetNumExecutors and numExecutorsPending stays above zero. The without_fix.png shows a single run of my query. As can be seen, targetNumExecutors and numExecutorsPending goes negative after the 1st run. Here is how I collected data in the source code- private def targetNumExecutors(): Int = { logInfo( "ZZZ " + numExecutorsPending + "," + executorIds.size + "," + executorsPendingToRemove.size + "," + (numExecutorsPending + executorIds.size - executorsPendingToRemove.size)) numExecutorsPending + executorIds.size - executorsPendingToRemove.size }
          Hide
          apachespark Apache Spark added a comment -

          User 'sryza' has created a pull request for this issue:
          https://github.com/apache/spark/pull/5704

          Show
          apachespark Apache Spark added a comment - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/5704
          Hide
          apachespark Apache Spark added a comment -

          User 'sryza' has created a pull request for this issue:
          https://github.com/apache/spark/pull/5856

          Show
          apachespark Apache Spark added a comment - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/5856

            People

            • Assignee:
              sandyr Sandy Ryza
              Reporter:
              cheolsoo Cheolsoo Park
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development