Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.0
-
None
Description
I've observed an issue happening consistently when:
- A job contains a join of two datasets
- One dataset is much larger than the other
- Both datasets require some processing before they are joined
What I have observed is:
- 2 stages are initially active to run processing on the two datasets
- These stages are run in parallel
- One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks)
- Spark allocates a similar (though not exactly equal) number of cores to each stage
- First stage completes (for the smaller dataset)
- Now there is only one stage running
- It still has many tasks left (usually > 20k tasks)
- Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
- This continues until the second stage completes
- Second stage completes, and third begins (the stage that actually joins the data)
- This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200)
Other interesting things about this:
- It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages
- Once all active stages are done, we release all cores to new stages
- I can't reproduce this locally on my machine, only on a cluster with YARN enabled
- It happens when dynamic allocation is enabled, and when it is disabled
- The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes
- This happens with spark.shuffle.service.enabled set to true and false
Attachments
Issue Links
- is related to
-
HDFS-13720 HDFS dataset Anti-Affinity Block Placement across all DataNodes for data local task optimization (improve Spark executor utilization & performance)
- Open