Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4498

Standalone Master can fail to recognize completed/failed applications

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.1.1, 1.2.0
    • 1.1.2, 1.2.0
    • Deploy, Spark Core
    • None

    Description

      We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted.

      This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this "blip" applications start, get resources, finish, and are marked as finished properly. The "blip" is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers.

      Deploy mode:

      • PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised

      Other factoids:

      • In most places, we call sc.stop() explicitly before shutting down our driver process
      • Here's the sum total of spark configuration options we don't set to the default:
            "spark.cores.max": 30
            "spark.eventLog.dir": "hdfs://nn.shopify.com:8020/var/spark/event-logs"
            "spark.eventLog.enabled": true
            "spark.executor.memory": "7g"
            "spark.hadoop.fs.defaultFS": "hdfs://nn.shopify.com:8020/"
            "spark.io.compression.codec": "lzf"
            "spark.ui.killEnabled": true
        
      • The exception the executors die with is this:
        14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
        14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
        14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
        14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
        14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban)
        14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
        14/11/19 19:42:37 INFO Remoting: Starting remoting
        14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@dn13.chi.shopify.com:37682]
        14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682.
        14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849
        14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
        	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
        	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
        	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
        	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
        Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:415)
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        	... 4 more
        Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        	at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        	at scala.concurrent.Await$.result(package.scala:107)
        	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
        	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
        	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
        	... 7 more
        

      Cluster history:

      • We run spark versions built from apache/spark#master snapshots. We did not observe this behaviour on 7eb9cbc273d758522e787fcb2ef68ef65911475f (sorry its so old), but now observe it on c6e0c2ab1c29c184a9302d23ad75e4ccd8060242. We can try new versions to assist debugging.

      Attachments

        1. one-applications-master-logs.txt
          76 kB
          Harry Brundage
        2. all-master-logs-around-blip.txt
          5.92 MB
          Harry Brundage

        Activity

          People

            markhamstra Mark Hamstra
            airhorns Harry Brundage
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: