Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18077

Run insert overwrite statements in spark to overwrite a partitioned table is very slow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 2.0.0
    • None
    • SQL
    • spark 2.0
      hive 2.0.1
      driver memory: 4g
      total executors: 4
      executor memory: 10g
      total cores: 13

    • Important

    Description

      Hello,all. I face a strange thing in my project.

      there is a table:

      CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` string, `recdate` string)
      PARTITIONED BY (`pt` string, `dt` string) stored as orc;

      another table:
      CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` string, `happened_time` int)
      PARTITIONED BY (`pt` string, `dt` string)

      ----------------------------------------------------------------------------------------------
      Test-1:

      executed sql in spark-shell or spark-sql( before i run this sql, there is much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):

      insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login where pt='mix_en' and dt='2016-10-21'

      it will cost a lot of time, below is a part of the logs:

      /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
      [Stage 5:=======================================> (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs]
      [Stage 5:=========================================> (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
      [Stage 5:============================================> (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs]
      [Stage 5:================================================> (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs]
      [Stage 5:================================================> (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs]
      [Stage 5:====================================================> (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs]
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      .......
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      /////////////////////////////////////////////////////////////////////////////////////////////////////////////////

      i can see, the origin data is moved to .trash

      and then, there is no log printing, and after about 10 min, the log print again:

      /////////////////////////////////////////////////////////////////////////////////////////////////////////////////
      16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00000, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000, Status:true
      16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00001, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001, Status:true
      16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00002, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002, Status:true
      16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00003, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003, Status:true
      16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00004, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004, Status:true
      ....
      16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-10000/part-00199, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199, Status:true
      /////////////////////////////////////////////////////////////////////////////////////////////////////////////////

      after which, the sql executing is end. but it costs over 10 min to handle 257128 lines of data

      ---------------------------------------------------------------------------------------------
      Test-2:

      if i remove all data of partition(pt='mix_en',dt='2016-10-21') in table login4game, and run such sql:

      ALTER TABLE login4game DROP IF EXISTS PARTITION (pt='mix_en', dt='2016-10-21');

      insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login where pt='mix_en' and dt='2016-10-21'

      the logs is the same:

      ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
      [Stage 5:=======================================> (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs]
      [Stage 5:=========================================> (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
      [Stage 5:============================================> (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs]
      [Stage 5:================================================> (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs]
      [Stage 5:================================================> (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs]
      [Stage 5:====================================================> (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs]
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00000' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00001' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00002' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00003' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00004' to trash at: hdfs://master.com/user/hadoop/.Trash/Current
      .......
      Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current

      /////////////////////////////////////////////////////////////////////////////////////////////////////////////////

      but after moving to trash, i don't need to wait for 10min, and the next log appears immediately.

      for about 20 seconds, this sql is finished, and i get the same result as the above executing(the drop partitions statement costs about 5 second).

      ----------------------------------------------------------------------------------------------
      Test-3:

      now, if i run the same sql in hive server(not using spark-sql or spark-shell to run, just run it in beeline or hive client):

      before running, i also load the same data in to login4game in partition(pt='mix_en',dt='2016-10-21')

      then i run such sql:

      insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login where pt='mix_en' and dt='2016-10-21'

      after running, i get the same result for about 20 seconds.

      ==========================================================

      now, it's the question, why its so slow when i use spark-sql to execute insert overwrite statement to handle a partitioned table with data(if there is no data in selected partitions, it runs fast) ?

      important point:

      1. it costs a lot of time in running move to trash when there is much data
      2. running the same sql in beeline is fast
      3. drop data from selected partitions before running executing insert overwrite statement will speeding the execution

      Is there anyone could help me? Thanks a lot!

      Attachments

        Activity

          People

            Unassigned Unassigned
            snodawn snodawn
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 120h
                120h
                Remaining:
                Remaining Estimate - 120h
                120h
                Logged:
                Time Spent - Not Specified
                Not Specified