Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.8.1, 0.8.2, 0.9.0
-
None
-
Cloudera/CDH 6.1
Spark 2.4
Hadoop 3.0
Zeppelin 0.8.2 (built from the latest merged pull request)
Description
Hello,
YARN cluster mode was introduced in `0.8.0` and fixed for not finding ZeppelinContext in `0.8.1`. However, I have difficulties to access any JAR in order to `import` them inside my notebook.
I have a CDH cluster, where everything works in deployMode `client`, but the moment I switch to `cluster` and the driver is not the same machine as Zeppelin server it can't find the packages.
Working configs
Inside interpreter:
master: yarn
spark.submit.deployMode: client
Inside `zeppelin-env.sh`:
export ZEPPELIN_IMPERSONATE_SPARK_PROXY_USER=false export ZEPPELIN_IMPERSONATE_CMD='sudo -H -u ${ZEPPELIN_IMPERSONATE_USER} bash -c ' export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark export SPARK_CONF_DIR=$SPARK_HOME/conf export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf export PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3 export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3 export PYTHONPATH=/opt/cloudera/parcels/Anaconda/envs/py36/bin/python3 export SPARK_SUBMIT_OPTIONS="--jars hdfs:///user/maziyar/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar
Since the JAR is already on HDFS, switching to `cluster` should be as simple as changing `spark.submit.deployMode` to the cluster. However, doing that results in:
import org.graphframes._ <console>:23: error: object graphframes is not a member of package org import org.graphframes._
I can see my JAR in Spark UI in `spark.yarn.dist.jars` and `spark.yarn.secondary.jars` in both cluster and client mode.
In client mode `sc.jars` will result:
res2: Seq[String] = List(file:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar)
However, in `cluster` mode the same command is empty. I thought maybe there is something extra or missing on Zeppelin Spark Interpreter that doesn't allow the JAR being used in cluster mode.
This is how Spark UI reports my JAR in `client` mode:
spark.repl.local.jars file:/tmp/spark-3aadfe3c-8821-4dfe-875b-744c2e35a95a/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar spark.yarn.dist.jars hdfs://hadoop-master-1:8020/user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar spark.yarn.secondary.jars graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar sun.java.command org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.executor.memory=5g --conf spark.driver.memory=8g --conf spark.driver.cores=4 --conf spark.yarn.isPython=true --conf spark.driver.extraClassPath=:/opt/zeppelin-0.8.2-new/interpreter/spark/:/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/lib/::/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/classes:/opt/zeppelin-0.8.2-new/zeppelin-interpreter/target/test-classes:/opt/zeppelin-0.8.2-new/zeppelin-zengine/target/test-classes:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar --conf spark.useHiveContext=true --conf spark.app.name=Zeppelin --conf spark.executor.cores=5 --conf spark.submit.deployMode=client --conf spark.dynamicAllocation.maxExecutors=50 --conf spark.dynamicAllocation.initialExecutors=1 --conf spark.dynamicAllocation.enabled=true --conf spark.driver.extraJavaOptions= -Dfile.encoding=UTF-8 -Dlog4j.configuration=file:///opt/zeppelin-0.8.2-new/conf/log4j.properties -Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark-mpanahi-zeppelin-hadoop-gateway.log --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer --jars hdfs:///user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar
This is how Spark UI reports my JAR in `cluster` mode (same configs as I mentioned above):
spark.repl.local.jars This field does not exist in cluster mode spark.yarn.dist.jars hdfs://hadoop-master-1:8020/user/mpanahi/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar spark.yarn.secondary.jars graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar sun.java.command org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer --jar file:/opt/zeppelin-0.8.2-new/interpreter/spark/spark-interpreter-0.8.2-SNAPSHOT.jar --arg 134.158.74.122 --arg 46130 --arg : --properties-file /yarn/nm/usercache/mpanahi/appcache/application_1547731772080_0077/container_1547731772080_0077_01_000001/_spark_conf/spark_conf_.properties
UPDATE: In Zeppelin 0.9.0, if I run this at the beginning, not only this JAR is accessible, but also all the JARs in --jar inside zeppelin-env.sh! If I don't do this it will fail as I mentioned before.
%spark.conf
spark.app.name multivac
spark.jars hdfs:///user/maziyar/jars/zeppelin/graphframes/graphframes-assembly-0.7.0-spark2.3-SNAPSHOT.jar
I kind of understand if graphframes becomes available even though it was already in the --jar, but this made the rest of my JARs also available means there is something here that pushes the others into the cluster.
Thank you.
Attachments
Issue Links
- links to