Uploaded image for project: 'Zeppelin'
  1. Zeppelin
  2. ZEPPELIN-2719

Can't get Spark interpreter to work with Cloudera's YARN cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.7.2
    • None
    • Interpreters
    • None
    • OS: Ubuntu 14.04.5 LTS
      JRE: 1.7.0_67
      Cloudera CDH 5.9.1
      Hadoop 2.6.0-cdh5.9.1 in HA mode
      Spark 1.6.1 running in a YARN cluster in HA mode
      Scala 2.10
      Kerberos

    Description

      Hi,

      I'm having problems getting the Spark interpreter to work. Every time I try to run it I get a connection refused error:

      java.net.ConnectException: Connection refused
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
              [...]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

      I've spent a few days trying to debug the issue and I'm at a point where I'm running out of ideas, so any help is greatly appreciated.

      I have built Zeppelin for my environment using:

      mvn clean package -Pspark-1.6 -Dhadoop.version=2.6.0-cdh5.9.1 -Phadoop-2.6 -Pvendor-repo -Pscala-2.10 -Pbuild-distr -DskipTests
      

      And have the following configuration in zeppelin-env.sh

      export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera/jre
      export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
      export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
      export HADOOP_CONF_DIR=/etc/hadoop/conf
      

      I read in a different issue that lowering the memory settings could help, so I added:

      export ZEPPELIN_JAVA_OPTS=" -Dspark.executor.memory=1g -Dspark.cores.max=2"
      export ZEPPELIN_MEM=" -Xms512m -Xmx1024m -XX:MaxPermSize=256m"
      export ZEPPELIN_INTP_MEM=" -Xms512m -Xmx1024m -XX:MaxPermSize=256m"
      export SPARK_SUBMIT_OPTIONS=" --driver-memory 512M --executor-memory 1G"
      

      But it doesn't seem to change anything, I get the same error.

      The Spark interpreter is configured as follows

      master:	yarn-client
      spark.app.name:	Zeppelin
      spark.yarn.keytab:	/opt/zeppelin/zeppelin.keytab
      spark.yarn.principal:	zeppelin@<REALM>
      zeppelin.dep.additionalRemoteRepository:	spark-packages,http://dl.bintray.com/spark-packages/maven,false;
      zeppelin.dep.localrepo:	local-repo
      zeppelin.pyspark.python:	python
      zeppelin.spark.concurrentSQL:	false
      zeppelin.spark.importImplicit:	true
      zeppelin.spark.maxResult:	1000
      zeppelin.spark.printREPLOutput:	true
      zeppelin.spark.sql.stacktrace:	false
      zeppelin.spark.useHiveContext:	true
      

      The zeppelin Kerberos principal and keytab should be ok, I'm using them with Livy and it works.

      Here are the relevant lines from zeppelin-zeppelin-<hostname>.log

       INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:188) - Create interpreter instance spark for note 2CGW3RAGX
       INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter org.apache.zeppelin.spark.SparkInterpreter 799822533 created
       INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter org.apache.zeppelin.spark.SparkSqlInterpreter 1517165558 created
       INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter org.apache.zeppelin.spark.DepInterpreter 1928192475 created
       INFO [2017-07-04 08:12:14,681] ({qtp1527142660-16} InterpreterFactory.java[createInterpretersForNote]:221) - Interpreter org.apache.zeppelin.spark.PySparkInterpreter 1602694095 created
       INFO [2017-07-04 08:12:20,051] ({pool-2-thread-2} SchedulerFactory.java[jobStarted]:131) - Job paragraph_1495010482434_695017792 started by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpretershared_session1222353445
       INFO [2017-07-04 08:12:20,052] ({pool-2-thread-2} Paragraph.java[jobRun]:362) - run paragraph 20170517-084122_2115191800 using spark org.apache.zeppelin.interpreter.LazyOpenInterpreter@2fac52c5
       INFO [2017-07-04 08:12:20,060] ({pool-2-thread-2} RemoteInterpreterManagedProcess.java[start]:126) - Run interpreter process [/opt/zeppelin/zeppelin/bin/interpreter.sh, -d, /opt/zeppelin/zeppelin/interpreter/spark, -p, 52698, -l, /opt/zeppelin/zeppelin/local-repo/2CJKGGV2U]
      ERROR [2017-07-04 08:12:50,124] ({Thread-36} RemoteScheduler.java[getStatus]:256) - Can't get status information
      org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
      	at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
      	at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
      	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
      	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
      	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92)
      	at org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:254)
      	at org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.run(RemoteScheduler.java:212)
      Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
      	at org.apache.thrift.transport.TSocket.open(TSocket.java:187)
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
      	... 8 more
      Caused by: java.net.ConnectException: Connection refused
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at org.apache.thrift.transport.TSocket.open(TSocket.java:182)
      	... 9 more
      ERROR [2017-07-04 08:12:50,124] ({pool-2-thread-2} RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: org.apache.zeppelin.spark.SparkInterpreter. Remove it from interpreterGroup
      ERROR [2017-07-04 08:12:50,125] ({Thread-35} RemoteInterpreterEventPoller.java[run]:102) - Can't get RemoteInterpreterEvent
      org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
      	at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
      	at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
      	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
      	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
      	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92)
      	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterEventPoller.run(RemoteInterpreterEventPoller.java:100)
      Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
      	at org.apache.thrift.transport.TSocket.open(TSocket.java:187)
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
      	... 7 more
      Caused by: java.net.ConnectException: Connection refused
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at org.apache.thrift.transport.TSocket.open(TSocket.java:182)
      	... 8 more
      ERROR [2017-07-04 08:12:50,125] ({pool-2-thread-2} RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: org.apache.zeppelin.spark.SparkSqlInterpreter. Remove it from interpreterGroup
      ERROR [2017-07-04 08:12:50,125] ({pool-2-thread-2} RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: org.apache.zeppelin.spark.DepInterpreter. Remove it from interpreterGroup
      ERROR [2017-07-04 08:12:50,126] ({pool-2-thread-2} RemoteInterpreter.java[open]:268) - Failed to initialize interpreter: org.apache.zeppelin.spark.PySparkInterpreter. Remove it from interpreterGroup
      ERROR [2017-07-04 08:12:50,126] ({pool-2-thread-2} Job.java[run]:188) - Job failed
      org.apache.zeppelin.interpreter.InterpreterException: org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
      	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:434)
      	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:106)
      	at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:387)
      	at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
      	at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53)
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37)
      	at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60)
      	at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861)
      	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435)
      	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
      	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:92)
      	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:432)
      	... 11 more
      Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
      	at org.apache.thrift.transport.TSocket.open(TSocket.java:187)
      	at org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51)
      	... 18 more
      Caused by: java.net.ConnectException: Connection refused
      	at java.net.PlainSocketImpl.socketConnect(Native Method)
      	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
      	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
      	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
      	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      	at java.net.Socket.connect(Socket.java:579)
      	at org.apache.thrift.transport.TSocket.open(TSocket.java:182)
      	... 19 more
      

      There's no zeppelin-interpreter-spark-zeppelin-hostname.log being created.
      The only error I can see in the YARN logs are these:

      log4j:ERROR Could not read configuration file from URL [file:/opt/zeppelin/zeppelin/conf/log4j.properties].
      java.io.FileNotFoundException: /opt/zeppelin/zeppelin/conf/log4j.properties (No such file or directory)
      	at java.io.FileInputStream.open(Native Method)
      	at java.io.FileInputStream.<init>(FileInputStream.java:146)
      	at java.io.FileInputStream.<init>(FileInputStream.java:101)
      	at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
      	at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
      	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
      	at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
      	at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
      	at org.apache.spark.Logging$class.initializeLogging(Logging.scala:121)
      	at org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogging(ApplicationMaster.scala:635)
      	at org.apache.spark.Logging$class.initializeLogIfNecessary(Logging.scala:106)
      	at org.apache.spark.deploy.yarn.ApplicationMaster$.initializeLogIfNecessary(ApplicationMaster.scala:635)
      	at org.apache.spark.Logging$class.log(Logging.scala:50)
      	at org.apache.spark.deploy.yarn.ApplicationMaster$.log(ApplicationMaster.scala:635)
      	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:649)
      	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
      log4j:ERROR Ignoring configuration file [file:/opt/zeppelin/zeppelin/conf/log4j.properties].
      Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
      [...]
      17/07/04 08:17:03 ERROR ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors. Failing the application.
      17/07/04 08:17:03 INFO ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: Timed out waiting for SparkContext.)
      17/07/04 08:17:03 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Timed out waiting for SparkContext.)
      

      Thanks!

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              loopbit Miguel
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: