Uploaded image for project: 'TinkerPop'
  1. TinkerPop
  2. TINKERPOP-1271

SparkContext should be restarted if Killed and using Persistent Context

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Done
    • 3.2.0-incubating, 3.1.2-incubating
    • 3.2.4
    • hadoop
    • None

    Description

      If the persisted Spark Context is killed by the user via the Spark UI or is terminated for some other error the Gremlin Console/Server is left with a stopped Spark Context. This could be caught and the spark context recreated. Oddly enough if you simply wait the context will "reset" itself or possible get GC'd out of the system and everything works again.

      ##Repo

      gremlin> g.V().count()
      WARN  org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer  - HADOOP_GREMLIN_LIBS is not set -- proceeding regardless
      ==>6
      gremlin> ERROR org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend  - Application has been killed. Reason: Master removed our application: KILLED
      ERROR org.apache.spark.scheduler.TaskSchedulerImpl  - Lost executor 0 on 10.150.0.180: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
      // Driver has been killed here via the Master UI
      
      gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
      ==>hadoopgraph[gryoinputformat->gryooutputformat]
      gremlin> g.V().count()
      WARN  org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer  - HADOOP_GREMLIN_LIBS is not set -- proceeding regardless
      java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
      This stopped SparkContext was created at:
      
      org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
      org.apache.tinkerpop.gremlin.spark.structure.Spark.create(Spark.java:53)
      org.apache.tinkerpop.gremlin.spark.structure.io.SparkContextStorage.open(SparkContextStorage.java:60)
      org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.lambda$submitWithExecutor$1(SparkGraphComputer.java:122)
      java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      java.lang.Thread.run(Thread.java:745)
      
      The currently active SparkContext was created at:
      
      org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
      org.apache.tinkerpop.gremlin.spark.structure.Spark.create(Spark.java:53)
      org.apache.tinkerpop.gremlin.spark.structure.io.SparkContextStorage.open(SparkContextStorage.java:60)
      org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.lambda$submitWithExecutor$1(SparkGraphComputer.java:122)
      java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      java.lang.Thread.run(Thread.java:745)
      

      Full trace from TP

      	at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106)
      	at org.apache.spark.SparkContext$$anonfun$newAPIHadoopRDD$1.apply(SparkContext.scala:1130)
      	at org.apache.spark.SparkContext$$anonfun$newAPIHadoopRDD$1.apply(SparkContext.scala:1129)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
      	at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
      	at org.apache.spark.SparkContext.newAPIHadoopRDD(SparkContext.scala:1129)
      	at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopRDD(JavaSparkContext.scala:507)
      	at org.apache.tinkerpop.gremlin.spark.structure.io.InputFormatRDD.readGraphRDD(InputFormatRDD.java:42)
      	at org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.lambda$submitWithExecutor$1(SparkGraphComputer.java:195)
      	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      

      If we wait a certain amount of time for some reason everything starts working again

      ERROR org.apache.spark.rpc.netty.Inbox  - Ignoring error
      org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
      	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:438)
      	at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:124)
      	at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead(AppClient.scala:264)
      	at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:172)
      	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
      	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
      	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
      	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      WARN  org.apache.spark.rpc.netty.NettyRpcEnv  - Ignored message: true
      WARN  org.apache.spark.deploy.client.AppClient$ClientEndpoint  - Connection to rspitzer-rmbp15.local:7077 failed; waiting for master to reconnect...
      WARN  org.apache.spark.deploy.client.AppClient$ClientEndpoint  - Connection to rspitzer-rmbp15.local:7077 failed; waiting for master to reconnect...
      gremlin> g.V().count()
      WARN  org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer  - HADOOP_GREMLIN_LIBS is not set -- proceeding regardless
      ==>6
      

      Attachments

        Issue Links

          Activity

            People

              okram Marko A. Rodriguez
              rspitzer Russell Spitzer
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: