Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14327

Scheduler holds locks which cause huge scheulder delays and executor timeouts

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 1.6.1
    • Fix Version/s: None
    • Component/s: Scheduler
    • Labels:

      Description

      I have a job which after a while in one of its stages grinds to a halt, from processing around 300k tasks in 15 minutes to less than 1000 in the next hour. The driver ends up using 100% CPU on a single core (out of 4) and the executors start failing to receive heartbeat responses, tasks are not scheduled and results trickle in.

      For this stage the max scheduler delay is 15 minutes, and the 75% percentile is 4ms.

      It appears that TaskScheulderImpl does most of its work whilst holding the global synchronised lock for the class, this synchronised lock is shared between at least,

      TaskSetManager.canFetchMoreResults
      TaskSchedulerImpl.handleSuccessfulTask
      TaskSchedulerImpl.executorHeartbeatReceived
      TaskSchedulerImpl.statusUpdate
      TaskSchedulerImpl.checkSpeculatableTasks

      This looks to severely limit the latency and throughput of the scheduler, and casuses my job to straight up fail due to taking too long.

        Attachments

        1. driver.jstack
          67 kB
          Chris Bannister

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Zariel Chris Bannister
              • Votes:
                1 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: