Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15313

Add export spark.yarn.archive or spark.yarn.jars variable in Hive on Spark document

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.3.0
    • None
    • None

    Description

      According to wiki, run queries in HOS16 and HOS20 in yarn mode.
      Following table shows the difference in query time between HOS16 and HOS20.

      Version Total time Time for Jobs Time for preparing jobs
      Spark16 51 39 12
      Spark20 54 40 14

      HOS20 spends more time(2 secs) on preparing jobs than HOS16. After reviewing the source code of spark, found that following point causes this:
      code:Client#distribute, In spark20, if spark cannot find spark.yarn.archive and spark.yarn.jars in spark configuration file, it will first copy all jars in $SPARK_HOME/jars to a tmp directory and upload the tmp directory to distribute cache. Comparing spark16,
      In spark16, it searches spark-assembly*.jar and upload it to distribute cache.

      In spark20, it spends 2 more seconds to copy all jars in $SPARK_HOME/jar to a tmp directory if we don't set "spark.yarn.archive" or "spark.yarn.jars".

      We can accelerate the startup of hive on spark 20 by settintg "spark.yarn.archive" or "spark.yarn.jars":
      set "spark.yarn.archive":

      cd $SPARK_HOME/jars
      zip spark-archive.zip ./*.jar # this is important, enter the jars folder then zip
      $ hadoop fs -copyFromLocal spark-archive.zip 
      $ echo "spark.yarn.archive=hdfs:///xxx:8020/spark-archive.zip" >> conf/spark-defaults.conf
      

      set "spark.yarn.jars":

      $ hadoop fs mkdir spark-2.0.0-bin-hadoop 
      $hadoop fs -copyFromLocal $SPARK_HOME/jars/* spark-2.0.0-bin-hadoop 
      $ echo "spark.yarn.jars=hdfs:///xxx:8020/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf
      

      Suggest to add this part in wiki.

      performance.improvement.after.set.spark.yarn.archive.PNG shows the detail performance impovement after setting spark.yarn.archive in small queries.

      Attachments

        Activity

          People

            kellyzly liyunzhang
            kellyzly liyunzhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: