Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32411

GPU Cluster Fail

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.0.0
    • None
    • PySpark, Web UI
    • None
    • Ihave a Apache Spark 3.0 cluster consisting of machines with multiple nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark,

    Description

      I'm having a difficult time getting a GPU cluster started on Apache Spark 3.0. It was hard to find documentation on this, but I stumbled on a NVIDIA github page for Rapids which suggested the following additional edits to the spark-defaults.conf:

      spark.task.resource.gpu.amount 0.25
      spark.executor.resource.gpu.discoveryScript ./usr/local/spark/getGpusResources.sh

      I have a Apache Spark 3.0 cluster consisting of machines with multiple nvidia-gpus and I connect my jupyter notebook to the cluster using pyspark, however it results in the following error: 

      Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
      : org.apache.spark.SparkException: You must specify an amount for gpu
      	at org.apache.spark.resource.ResourceUtils$.$anonfun$parseResourceRequest$1(ResourceUtils.scala:142)
      	at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
      	at org.apache.spark.resource.ResourceUtils$.parseResourceRequest(ResourceUtils.scala:142)
      	at org.apache.spark.resource.ResourceUtils$.$anonfun$parseAllResourceRequests$1(ResourceUtils.scala:159)
      	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
      	at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
      	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
      	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
      	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
      	at org.apache.spark.resource.ResourceUtils$.parseAllResourceRequests(ResourceUtils.scala:159)
      	at org.apache.spark.SparkContext$.checkResourcesPerTask$1(SparkContext.scala:2773)
      	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2884)
      	at org.apache.spark.SparkContext.<init>(SparkContext.scala:528)
      	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      	at py4j.Gateway.invoke(Gateway.java:238)
      	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
      	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
      	at py4j.GatewayConnection.run(GatewayConnection.java:238)
      	at java.lang.Thread.run(Thread.java:748)
      

      After this, I then tried adding another line to the conf per the instructions which results in no errors, however when I log in to the Web UI at localhost:8080, under Running Applications, the state remains at waiting.

      spark.task.resource.gpu.amount                  2
      spark.executor.resource.gpu.discoveryScript    ./usr/local/spark/getGpusResources.sh
      spark.executor.resource.gpu.amount              1
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            vinhdiesal Vinh Tran
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: