Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.4
-
None
Description
When spark-submit is run with image built with "hadoop free" spark and user provided hadoop it fails on kubernetes (hadoop libraries are not on spark's classpath).
I downloaded spark Pre-built with user-provided Apache Hadoop.
I created docker image with usage of [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh].
Based on this image (2.4.4-without-hadoop)
I created another one with Dockerfile
FROM spark-py:2.4.4-without-hadoop ENV SPARK_HOME=/opt/spark/ # This is needed for newer kubernetes versions ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar $SPARK_HOME/jars COPY spark-env.sh /opt/spark/conf/spark-env.sh RUN chmod +x /opt/spark/conf/spark-env.sh RUN wget -qO- https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz | tar xz -C /opt/ ENV HADOOP_HOME=/opt/hadoop-3.2.1 ENV PATH=${HADOOP_HOME}/bin:${PATH}
Contents of spark-env.sh:
#!/usr/bin/env bash
export SPARK_DIST_CLASSPATH=$(hadoop classpath):$HADOOP_HOME/share/hadoop/tools/lib/*
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
spark-submit run with image crated this way fails since spark-env.sh is overwritten by volume created when pod starts
As quick workaround I tried to modify entrypoint script to run spark-env.sh during startup and moving spark-env.sh to a different directory.
Driver starts without issues in this setup however, evethough SPARK_DIST_CLASSPATH is set executor is constantly failing:
PS C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda> kubectl logs rda-script-1571835692837-exec-12 ++ id -u + myuid=0 ++ id -g + mygid=0 + set +e ++ getent passwd 0 + uidentry=root:x:0:0:root:/root:/bin/ash + set -e + '[' -z root:x:0:0:root:/root:/bin/ash ']' + source /opt/spark-env.sh +++ hadoop classpath ++ export 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++ SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*' ++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native ++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native ++ echo 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*' SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/* ++ echo LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native + SPARK_K8S_CMD=executor LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native + case "$SPARK_K8S_CMD" in + shift 1 + SPARK_CLASSPATH=':/opt/spark//jars/*' + env + sed 's/[^=]*=\(.*\)/\1/g' + sort -t_ -k4 -n + grep SPARK_JAVA_OPT_ + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -n '' ']' + PYSPARK_ARGS= + '[' -n '' ']' + R_ARGS= + '[' -n '' ']' + '[' '' == 2 ']' + '[' '' == 3 ']' + case "$SPARK_K8S_CMD" in + CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP) + exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java -Xms3g -Xmx3g -cp ':/opt/spark//jars/*' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@rda-script-1571835692837-driver-svc.default.svc:7078 --executor-id 12 --cores 1 --app-id spark-33382c27389c4b289d79c06d5f631819 --hostname 10.244.2.24 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:186) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 3 more
Attachments
Issue Links
- is duplicated by
-
SPARK-29882 SPARK on Kubernetes is Broken for SPARK with no Hadoop release
- Resolved
-
SPARK-33271 Hadoop Free Build Setup for Spark 2.4.7 on Kubernetes
- Closed
- links to