Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
Description
According to wiki, run queries in HOS16 and HOS20 in yarn mode.
Following table shows the difference in query time between HOS16 and HOS20.
Version | Total time | Time for Jobs | Time for preparing jobs |
---|---|---|---|
Spark16 | 51 | 39 | 12 |
Spark20 | 54 | 40 | 14 |
HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing the source code of spark, found that following point causes this:
code:Client#distribute, In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in spark configuration file, it will first copy all jars in $SPARK_HOME/jars to a tmp directory and upload the tmp directory to distribute cache. Comparing spark16,
In spark16, it searches spark-assembly*.jar and upload it to distribute cache.
In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".
We can accelerate the startup of hive on spark 20 by settintg "spark.yarn.archive" or "spark.yarn.jars":
set "spark.yarn.archive":
cd $SPARK_HOME/jars zip spark-archive.zip ./*.jar # this is important, enter the jars folder then zip $ hadoop fs -copyFromLocal spark-archive.zip $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> conf/spark-defaults.conf
set "spark.yarn.jars":
$ hadoop fs mkdir spark-2.0.0-bin-hadoop
$hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop
$ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf
Suggest to add this part in wiki.
performance.improvement.after.set.spark.yarn.archive.PNG shows the detail performance impovement after setting spark.yarn.archive in small queries.