Description
To repro the problem:
- Whitelist some local path using livy.file.local-dir-whitelist.
- Use yarn-cluster mode
- Submit a job through livy with files parameter referencing a local file that exists only locally but not on worker nodes.
The job will fail. This is because SparkContext is trying to find the local file on the driver node. But not the node that running spark-submit.
Error:
java.io.FileNotFoundException: Added file file:/tmp/a does not exist. at org.apache.spark.SparkContext.addFile(SparkContext.scala:1388) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1364) at org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:491) at org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:491) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.SparkContext.<init>(SparkContext.scala:491) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2305) at com.cloudera.livy.repl.SparkInterpreter$$anonfun$start$1.apply(SparkInterpreter.scala:123) at com.cloudera.livy.repl.SparkInterpreter$$anonfun$start$1.apply(SparkInterpreter.scala:87) at com.cloudera.livy.repl.SparkInterpreter.restoreContextClassLoader(SparkInterpreter.scala:369) at com.cloudera.livy.repl.SparkInterpreter.start(SparkInterpreter.scala:87) at com.cloudera.livy.repl.Session$$anonfun$1.apply(Session.scala:63) at com.cloudera.livy.repl.Session$$anonfun$1.apply(Session.scala:61) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
We didn't see this problem with Livy 0.1. We found that in 0.2, files parameter isn't mapped to --files in spark-submit but to SparkConf spark.files. spark-submit handles local files specified in --files on the spark-submit node. Where spark.files is handled on the driver node. Hence the difference.
I did the following experiment to confirm the difference between --files and spark.files.
First, do a spark-submit directly using --files to reference that additional file. This works fine.
Then, do a spark-submit using --conf "spark.files=xxx" with the same reference file. This will fail with the same error message in Livy.
The problem seems to be --conf "spark.files=xxx" is not equivalent to --files in spark-submit, and when user use files parameter in Livy, they would expect it to behave like --files in spark-submit. This needs to be fixed.