Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14209

Application failure during preemption.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.1
    • Fix Version/s: 1.6.3, 2.0.1, 2.1.0
    • Component/s: Block Manager
    • Labels:
      None
    • Environment:

      Spark on YARN

      Description

      We have a fair-sharing cluster set up, including the external shuffle service. When a new job arrives, existing jobs are successfully preempted down to fit.

      A spate of these messages arrives:
      ExecutorLostFailure (executor 48 exited unrelated to the running tasks) Reason: Container container_1458935819920_0019_01_000143 on host: ip-10-12-46-235.us-west-2.compute.internal was preempted.

      This seems fine - the problem is that soon thereafter, our whole application fails because it is unable to fetch blocks from the pre-empted containers:

      org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 locations. Most recent failure cause:
      Caused by: java.io.IOException: Failed to connect to ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
      Caused by: java.net.ConnectException: Connection refused: ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681

      Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf

      Spark does not attempt to recreate these blocks - the tasks simply fail over and over until the maxTaskAttempts value is reached.

      It appears to me that there is some fault in the way preempted containers are being handled - shouldn't these blocks be recreated on demand?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                joshrosen Josh Rosen
                Reporter:
                milesc Miles Crawford
              • Votes:
                1 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: