Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-1984

Zeppelin Server doesn't catch all exceptions when launching a new interpreter process

    XMLWordPrintableJSON

Details

    Description

      We saw below exception stack when Zeppelin server tries to start a new interpreter process, for example, Spark interpreter. It was really hard to debug and the only way to capture real root cause, was to add

      LOG="/tmp/interpreter.sh-$$.log"
      date >> $LOG
      set -x
      exec >> $LOG
      exec 2>&1
      

      to $zeppelinhome/bin/interpreter.sh file
      so all stdout and stderr from the interpreter.sh would go to that file.
      So it showed real problem

      Exception in thread "main" org.apache.spark.SparkException: Keytab file: /home/<username>/.kt does not exist
              at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555)
              at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
      ...
      

      while all other Zeppelin logs and note output was showing misleading "Connection refused" - see below stack

      ERROR [2017-01-18 16:54:38,533] ({pool-2-thread-2} NotebookServer.java[afterStatusChange]:1645) - Error
      org.apache.zeppelin.interpreter.InterpreterException: org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
              at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:232)
              at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:400)
              at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:105)
              at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:316)
              at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
              at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      ...
      

      The issue might be that after interpreter.sh is started and exits right away -
      https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L121
      this does not get captured anywhere. The only sign you'll see on Zeppelin side is "Connection refused" as Zeppelin wouldn't be able to connect to a new interpreter process. We saw different root causes (above error from spark-submit that keytab file doesn't exist is just one of them), and every time we had to add tracing into interpreter.sh to capture real problem.

      We think there are two possible ways to improve that:
      1) capture fact that interpreter.sh bails out (and don't try to connect in https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L132 as it'll produce expected "Connection refused")
      2) if one point 1) isn't possible for some reason (although I don't why that would be) - at least capture errors produced by interpreter.sh so error stack in Zeppelin log files and paragraph output that kicked off interpreter start would have some meaningful information.

      Attachments

        Activity

          People

            Tagar Ruslan Dautkhanov
            Tagar Ruslan Dautkhanov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: