Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-3435

Interpreter timeout lifecycle leads to interpreter process orphans

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.8.0
    • 0.8.0
    • zeppelin-zengine
    • None

    Description

      We have configured to Timeout our interpreters after 60 minutes. From time to time an interpreter is not closed properly. The remote interpreter process is still alive. This behavior is non-deterministic. 

      When timeout is reached only the following is logged:

      INFO [2018-04-27 13:06:44,329] ({Timer-0} TimeoutLifecycleManager.java[run]:49) - InterpreterGroup spark:shared_process is timeout.
      INFO [2018-04-27 13:06:44,329] ({Timer-0} ManagedInterpreterGroup.java[close]:89) - Close InterpreterGroup: spark:shared_process
      INFO [2018-04-27 13:06:44,329] ({Timer-0} ManagedInterpreterGroup.java[close]:100) - Close Session: 2D8VRV5M6 for interpreter setting: spark
      WARN [2018-04-27 13:06:44,329] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
      spark.SparkInterpreter
      WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
      spark.SparkSqlInterpreter
      WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
      spark.DepInterpreter
      WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
      spark.PySparkInterpreter
      WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
      spark.IPySparkInterpreter
      WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
      spark.SparkRInterpreter
      INFO [2018-04-27 13:06:44,330] ({Timer-0} ManagedInterpreterGroup.java[close]:105) - Remove this InterpreterGroup: spark:shared_process as all the
      sessions are closed
      

      For successful shutdown situation we also see those log entries, but they are missing in the case of this bug:

      ...
      INFO [2018-04-27 13:11:20,485] ({Timer-0} ManagedInterpreterGroup.java[close]:108) - Kill RemoteInterpreterProcess
      INFO [2018-04-27 13:11:20,485] ({Timer-0} RemoteInterpreterManagedProcess.java[stop]:220) - Kill interpreter process
      ERROR [2018-04-27 13:11:20,692] ({Thread-71907} RemoteInterpreterEventPoller.java[run]:257) - Can not get RemoteInterpreterEvent because it is shutdown.
      ERROR [2018-04-27 13:11:20,692] ({pool-30-thread-1} AppendOutputRunner.java[run]:68) - Wait for OutputBuffer queue interrupted: null
      WARN [2018-04-27 13:11:22,991] ({Timer-0} RemoteInterpreterManagedProcess.java[stop]:230) - ignore the exception when shutting down
      INFO [2018-04-27 13:11:22,993] ({Timer-0} RemoteInterpreterManagedProcess.java[stop]:238) - Remote process terminated
      
      

      So in case of the Bug line 108 of ManagedInterpreterGroup is never reached.

      When triggering a notebook after the timeout has occured, a new additional interpreter gets started and the first one stays alive forever.

      Also restart the interpreter does not kill the first process.

      Only after restarting zeppelin, all interpreter process orphans are killed.

      Attachments

        Issue Links

          Activity

            People

              zjffdu Jeff Zhang
              aweise Andreas Weise
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: