Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24474

Cores are left idle when there are a lot of tasks to run



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • Scheduler, Spark Core


      I've observed an issue happening consistently when:

      • A job contains a join of two datasets
      • One dataset is much larger than the other
      • Both datasets require some processing before they are joined

      What I have observed is:

      • 2 stages are initially active to run processing on the two datasets
        • These stages are run in parallel
        • One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks)
        • Spark allocates a similar (though not exactly equal) number of cores to each stage
      • First stage completes (for the smaller dataset)
        • Now there is only one stage running
        • It still has many tasks left (usually > 20k tasks)
        • Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
        • This continues until the second stage completes
      • Second stage completes, and third begins (the stage that actually joins the data)
        • This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200)

      Other interesting things about this:

      • It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages
      • Once all active stages are done, we release all cores to new stages
      • I can't reproduce this locally on my machine, only on a cluster with YARN enabled
      • It happens when dynamic allocation is enabled, and when it is disabled
      • The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes
      • This happens with spark.shuffle.service.enabled set to true and false


        Issue Links



              Unassigned Unassigned
              alrocks46 Al M
              0 Vote for this issue
              4 Start watching this issue