Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17511

Dynamic allocation race condition: Containers getting marked failed while releasing

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0, 2.0.1, 2.1.0
    • 2.0.1, 2.1.0
    • Spark Core, YARN
    • None

    Description

      While trying to reach launch multiple containers in pool, if running executors count reaches or goes beyond the target running executors, the container is released and marked failed. This can cause many jobs to be marked failed causing overall job failure.

      I will have a patch up soon after completing testing.

      Typical Exception found in Driver marking the container to Failed
      java.lang.AssertionError: assertion failed
              at scala.Predef$.assert(Predef.scala:156)
              at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
              at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kishorvpatil Kishor Patil
            kishorvpatil Kishor Patil
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment