[SPARK-14327] Scheduler holds locks which cause huge scheulder delays and executor timeouts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.6.1
Fix Version/s: None
Component/s: Scheduler, Spark Core
Labels:
- bulk-closed

Description

I have a job which after a while in one of its stages grinds to a halt, from processing around 300k tasks in 15 minutes to less than 1000 in the next hour. The driver ends up using 100% CPU on a single core (out of 4) and the executors start failing to receive heartbeat responses, tasks are not scheduled and results trickle in.

For this stage the max scheduler delay is 15 minutes, and the 75% percentile is 4ms.

It appears that TaskScheulderImpl does most of its work whilst holding the global synchronised lock for the class, this synchronised lock is shared between at least,

TaskSetManager.canFetchMoreResults
TaskSchedulerImpl.handleSuccessfulTask
TaskSchedulerImpl.executorHeartbeatReceived
TaskSchedulerImpl.statusUpdate
TaskSchedulerImpl.checkSpeculatableTasks

This looks to severely limit the latency and throughput of the scheduler, and casuses my job to straight up fail due to taking too long.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

driver.jstack
01/Apr/16 10:24
67 kB
Chris Bannister

Issue Links

relates to

SPARK-13279 Scheduler does O(N^2) operation when adding a new task set (making it prohibitively slow for scheduling 200K tasks)

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Chris Bannister

Votes:: 1 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 01/Apr/16 10:22

Updated:: 17/May/20 17:47

Resolved:: 21/May/19 04:16