Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.7.3
-
None
-
None
Description
I'm running EMR 5.16.0 on AWS. If I try to run any Spark SQL queries against my RDBMS using the Scala interpreter, they seem to execute just fine, however the log file fills with this exception over and over again:
ERROR [2018-08-16 22:04:36,601] ({pool-2-thread-2} SparkInterpreter.java[getProgressFromStage_1_1x]:1503) - Error on getting progress information java.lang.NoSuchMethodException: org.apache.zeppelin.spark.SparkInterpreter$1.stageIdToData() at java.lang.Class.getMethod(Class.java:1786) at org.apache.zeppelin.spark.SparkInterpreter.getProgressFromStage_1_1x(SparkInterpreter.java:1487) at org.apache.zeppelin.spark.SparkInterpreter.getProgressFromStage_1_1x(SparkInterpreter.java:1510) at org.apache.zeppelin.spark.SparkInterpreter.getProgress(SparkInterpreter.java:1430) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:117) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.getProgress(RemoteInterpreterServer.java:555) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1762) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1747) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
This simple code will trigger it (hitting my own database), though I'm not convinced it has anything to do with Spark SQL, but instead with long running commands.
import org.apache.spark.sql._ val dbConnectionMap = Map( "url" -> "<redacted>", "driver" -> "com.mysql.jdbc.Driver" ) val sql = """(select item_name from product_catalog) as product_catalog""" val products = spark.read.format("jdbc").options(dbConnectionMap + ("dbtable" -> sql)).load.cache products.count
This wouldn't be a big concern since the execution works, except that after a couple hours of analyzing data, I started getting file system errors. It turned out to be caused by the log file taking up all the hard drive space, 33GB!