Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6353

Handling fatal errors of executors and decommission datanodes

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • Spark Core, YARN
    • None

    Description

      We're facing "No space left on device" errors lately from time to time. The job will fail after retries. Obvious in such case, retry won't be helpful.

      Sure it's the problem in the datanodes but I'm wondering if Spark Driver can handle it and decommission the problematic datanode before retrying it. And maybe dynamically allocate another datanode if dynamic allocation is enabled.

      I think there needs to be a class of fatal errors that can't be recovered with retries. And it's best Spark can handle it nicely.

      Jianshi

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              huangjs Jianshi Huang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: