Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20187

Replace loadTable with moveFile to speed up load table for many output files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.1.0
    • None
    • SQL
    • None

    Description

      HiveClientImpl.loadTable load files one by one, so this step will take a long time if a job generates many files. There is a Hive.moveFile api can speed up this step for create table tableName as select ... and insert overwrite table tableName select ...

      Here are two APIs comparison:

      loadTable api: It took about 26 minutes(10:50:14 - 11:16:18) to load table
      17/04/01 10:50:04 INFO TaskSetManager: Finished task 207165.0 in stage 0.0 (TID 216796) in 5952 ms on jqhadoop-test28-8.int.yihaodian.com (executor 54) (216869/216869)
      17/04/01 10:50:04 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
      17/04/01 10:50:04 INFO DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376) finished in 541.797 s
      17/04/01 10:50:04 INFO DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376, took 551.208919 s
      17/04/01 10:50:04 INFO FileFormatWriter: Job null committed.
      17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00000-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
      17/04/01 10:50:14 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-00001-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
      
      ...
      
      17/04/01 11:16:11 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_10-40-02_349_8047899863313770218-1/-ext-10000/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow/part-99999-9335c5f3-60fa-418b-a466-2d76a5e84537-c000, Status:true
      17/04/01 11:16:18 INFO SparkSqlParser: Parsing command: `tmp`.`spark_load_slow`
      17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
      17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
      17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
      17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
      17/04/01 11:16:18 INFO CatalystSqlParser: Parsing command: string
      Time taken: 2178.736 seconds
      17/04/01 11:16:18 INFO CliDriver: Time taken: 2178.736 seconds
      
      moveFile api: It took about 9 minutes(13:24:39 - 13:33:46) to load table
      17/04/01 13:24:38 INFO TaskSetManager: Finished task 210610.0 in stage 0.0 (TID 216829) in 5888 ms on jqhadoop-test28-28.int.yihaodian.com (executor 59) (216869/216869)
      17/04/01 13:24:38 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
      17/04/01 13:24:38 INFO DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376) finished in 532.409 s
      17/04/01 13:24:38 INFO DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376, took 539.337610 s
      17/04/01 13:24:39 INFO FileFormatWriter: Job null committed.
      17/04/01 13:24:39 INFO Hive: Replacing src:viewfs://cluster4/user/hive/warehouse/staging/.hive-staging_hive_2017-04-01_13-14-46_099_8962745596360417817-1/-ext-10000, dest: viewfs://cluster4/user/hive/warehouse/tmp.db/spark_load_slow_movefile, Status:true
      17/04/01 13:33:46 INFO SparkSqlParser: Parsing command: `tmp`.`spark_load_slow_movefile`
      17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
      17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
      17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
      17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
      17/04/01 13:33:46 INFO CatalystSqlParser: Parsing command: string
      Time taken: 1142.671 seconds
      17/04/01 13:33:46 INFO CliDriver: Time taken: 1142.671 seconds
      

      More log can be find in attachments.

      Attachments

        1. spark.moveFile.log.tar.gz
          5.43 MB
          Yuming Wang
        2. spark.loadTable.log.tar.gz
          6.64 MB
          Yuming Wang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yumwang Yuming Wang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: