Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.5.1
-
None
-
None
-
Hive 1.2, Spark on YARN
Description
This is a regression in Spark 1.5, more specifically after upgrading Hive dependency to 1.2.
HIVE-2573 introduced a new feature that allows users to register functions in session. The problem is that it added a static code block to Hive.java-
// register all permanent functions. need improvement static { try { reloadFunctions(); } catch (Exception e) { LOG.warn("Failed to access metastore. This class should not accessed in runtime.",e); } }
This code block is executed by every Spark executor in cluster when HadoopRDD tries to access to JobConf. So if Spark job has a high parallelism (eg 1000+), executors will hammer the HCat server causing it to go down in the worst case.
Here is the stack trace that I took in executor when it makes a connection to Hive metastore-
15/10/06 19:26:05 WARN conf.HiveConf: HiveConf of name hive.optimize.s3.query does not exist 15/10/06 19:26:05 INFO hive.metastore: XXX: java.lang.Thread.getStackTrace(Thread.java:1589) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74) 15/10/06 19:26:05 INFO hive.metastore: XXX: sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 15/10/06 19:26:05 INFO hive.metastore: XXX: sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) 15/10/06 19:26:05 INFO hive.metastore: XXX: sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 15/10/06 19:26:05 INFO hive.metastore: XXX: java.lang.reflect.Constructor.newInstance(Constructor.java:526) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:803) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:782) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:347) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.sql.hive.HadoopTableReader$anonfun$17.apply(TableReader.scala:322) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.sql.hive.HadoopTableReader$anonfun$17.apply(TableReader.scala:322) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD$anonfun$getJobConf$6.apply(HadoopRDD.scala:179) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD$anonfun$getJobConf$6.apply(HadoopRDD.scala:179) 15/10/06 19:26:05 INFO hive.metastore: XXX: scala.Option.map(Option.scala:145) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:179) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD$anon$1.<init>(HadoopRDD.scala:231) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:227) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:103) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:97) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.scheduler.Task.run(Task.scala:88) 15/10/06 19:26:05 INFO hive.metastore: XXX: org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 15/10/06 19:26:05 INFO hive.metastore: XXX: java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 15/10/06 19:26:05 INFO hive.metastore: XXX: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 15/10/06 19:26:05 INFO hive.metastore: XXX: java.lang.Thread.run(Thread.java:745) 15/10/06 19:26:05 INFO hive.metastore: Trying to connect to metastore with URI thrift://admin.gateway.dataeng.netflix.net:11002
As can be seen, HadoopRDD tries to get JobConf in executor, which in turn invokes the reloadFunctions() function in Hive.java.
What's worse, due to HIVE-10319, a single reloadFunctions() call ends up making hundreds of thrift calls to Hive metastore if there are a large number of databases in Hive metastore. So any Spark job can easily take down HCat server in production.
As a workaround, I forked Databrick's Hive 1.2 repo, removed the static code block from Hive.java, and rebuilt Spark with this forked version of Hive. I don't know if there is a better way of fixing this problem.
Attachments
Issue Links
- is related to
-
SPARK-10679 javax.jdo.JDOFatalUserException in executor
- Resolved
- relates to
-
SPARK-8064 Upgrade Hive to 1.2
- Resolved