Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20869

Master should clear failed apps when worker down

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.0
    • None
    • Spark Core

    Description

      In `Master.removeWorker`, master clears executor and driver state, but does not clear app state. App state is cleared when received `UnregisterApplication` and when `onDisconnect`, the first is when driver shutdown gracefully, the second is called when `netty`'s `channelInActive` is called (which is called when channel is closed), both of which can not handle the case when there is a network partition between master and worker.

      Follow the steps in SPARK-19900, and see the screenshots when worker1 partitions with master, the app `app-xxx-000` is still running instead of finished because of worker1 is down.

      cc CodingCat

      Attachments

        Activity

          People

            Unassigned Unassigned
            lyc Li Yichao
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified