Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30297

Executor heartbeat expired cause app hung up forever

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.0, 2.4.4
    • None
    • Spark Core
    • None

    Description

      Backgroud

      The driver can't sense this executor was lost through the network connection disconnection If an executor was lost in the network and it have not responsed rst and close packet to driver, so driver can only sense this executor dead through heartbeat expired.

      Problems

      Heartbeat expiration processing flow as follows:

      1. Executor heartbeat expired as above.
      2. HeartbeatReceiver will call scheduler executor lost to rescheduler the tasks on this executor.
      3. HeartbeatReceiver kill the executor.

      The tasks on the dead executor have a chance to rescheduled on this dead executor again if the task rescheduler before the executor has't remove from executorBackend, it will send launch task to this executor again, the executor will not response and the driver can't sense through heartbeat beause the executor has lost in network. This cause those tasks rescheduled on this lost executor can't finish forever, and the app will hung up here forever.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yuhaiyang haiyangyu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: