Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.7.0
-
Zeppelin server from a month old master snapshot
Description
We saw below exception stack when Zeppelin server tries to start a new interpreter process, for example, Spark interpreter. It was really hard to debug and the only way to capture real root cause, was to add
LOG="/tmp/interpreter.sh-$$.log"
date >> $LOG
set -x
exec >> $LOG
exec 2>&1
to $zeppelinhome/bin/interpreter.sh file
so all stdout and stderr from the interpreter.sh would go to that file.
So it showed real problem
Exception in thread "main" org.apache.spark.SparkException: Keytab file: /home/<username>/.kt does not exist at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158) ...
while all other Zeppelin logs and note output was showing misleading "Connection refused" - see below stack
ERROR [2017-01-18 16:54:38,533] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:1645) - Error org.apache.zeppelin.interpreter.InterpreterException: org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:232) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:400) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:105) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:316) at org.apache.zeppelin.scheduler.Job.run(Job.java:176) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ...
The issue might be that after interpreter.sh is started and exits right away -
https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L121
this does not get captured anywhere. The only sign you'll see on Zeppelin side is "Connection refused" as Zeppelin wouldn't be able to connect to a new interpreter process. We saw different root causes (above error from spark-submit that keytab file doesn't exist is just one of them), and every time we had to add tracing into interpreter.sh to capture real problem.
We think there are two possible ways to improve that:
1) capture fact that interpreter.sh bails out (and don't try to connect in https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L132 as it'll produce expected "Connection refused")
2) if one point 1) isn't possible for some reason (although I don't why that would be) - at least capture errors produced by interpreter.sh so error stack in Zeppelin log files and paragraph output that kicked off interpreter start would have some meaningful information.