Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22976

Worker cleanup can remove running driver directories

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.2
    • Fix Version/s: 2.3.0
    • Component/s: Deploy, Spark Core
    • Labels:
      None

      Description

      Spark Standalone worker cleanup finds directories to remove with a listFiles command

      This includes both application directories and driver directories from cluster mode submitted applications.

      A directory is considered to not be part of a running app if the worker does not have an executor with a matching ID.

      https://github.com/apache/spark/blob/v2.2.1/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L432

            val appIds = executors.values.map(_.appId).toSet
            val isAppStillRunning = appIds.contains(appIdFromDir)
      

      If a driver has been started on a node, but all of the executors are on other nodes, the worker running the driver will always assume that the driver directory is not part of a running app.

      Consider a two node spark cluster with Worker A and Worker B where each node has a single core available. We submit our application in deploy mode cluster, the driver begins running on Worker A while the Executor starts on B.

      Worker A has a cleanup triggered and looks and finds it has a directory

      /var/lib/spark/worker/driver-20180105234824-0000
      

      Worker A check's it's executor list and finds no entries which match this since it has no corresponding executors for this application. Worker A then removes the directory even though it may still be actively running.

      I think this could be fixed by modifying line 432 to be

            val appIds = executors.values.map(_.appId).toSet ++ drivers.values.map(_.driverId)
      

      I'll run a test and submit a PR soon.

        Attachments

          Activity

            People

            • Assignee:
              rspitzer Russell Spitzer
              Reporter:
              rspitzer Russell Spitzer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: