Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4442

tez unable to control the memory size when UDF occupies 100MB memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Invalid
    • 0.9.1
    • None
    • None
    • None
    • CDP7.1.7SP1

      tez 0.9.1

      hive 3.1.3

       

    Description

                We have a UDF which loads about 5 million records into memory, and matchs the data in the memory according to the user's input, and finally return the output. Each input record of the UDF will lead to one output.

                Based on heapdump analysis, this  udf occupies about 100MB of memory. The UDF runs stably in hive on MR, hive on spark and native spark, and only needs about 4GB of memory for that situation. However, if we use tez engine,  we adjust the memory from 4G to 8g, the task will fail. Even if we adjust the memory to 12g, the task will fail with a high probability. Why does tez engine need so much memory compared to Mr and spark? Is there a good tuning method to control the amount of memory ?

       

       

      command is as follows:
      beeline -u 'jdbc:hive2://bg21146.hadoop.com:10000/default;principal=hive/bg21146.hadoop.com@BG.COM' --hiveconf tez.queue.name=root.000kjb.bdhmgmas_bas -e "
       
      create temporary function get_card_rank as 'com.unionpay.spark.udf.GenericUDFCupsCardMediaProc' using jar 'hdfs:///user/lib/spark-udf-0.0.1-SNAPSHOT.jar';
       
      set tez.am.log.level=debug;
      set tez.am.resource.memory.mb=8192;
      set hive.tez.container.size=8192;
      set tez.task.resource.memory.mb=2048;
      set tez.runtime.io.sort.mb=1200;
      set hive.auto.convert.join.noconditionaltask.size=500000000;
      set tez.runtime.unordered.output.buffer.size-mb=800;
      set tez.grouping.min-size=33554432;
      set tez.grouping.max-size=536870912;
      set hive.tez.auto.reducer.parallelism=true;
      set hive.tez.min.partition.factor=0.25;
      set hive.tez.max.partition.factor=2.0;
      set hive.exec.reducers.bytes.per.reducer=268435456;
      set mapreduce.map.memory.mb=4096;
      set ipc.maximum.response.length=1536000000;
       
       
      select
       get_card_rank(ext_pri_acct_no) as ext_card_media_proc_md,
       count(*)
      from bs_comdb.tmp_bscom_glhis_ct_settle_dtl_bas_swt a
      where a.hp_settle_dt = '20200910'
      group by get_card_rank(ext_pri_acct_no)
      ;
      "

       

      Attachments

        1. java heap2.png
          202 kB
          Authur Wang
        2. java heap1.png
          171 kB
          Authur Wang
        3. hiveserver2.out
          89 kB
          Authur Wang
        4. application_1659706606596_0047.log.gz
          4.48 MB
          Authur Wang
        5. app.log
          90 kB
          Authur Wang
        6. spark-udf-0.0.1-SNAPSHOT.jar
          222 kB
          Authur Wang
        7. spark-udf-src.zip
          127 kB
          Authur Wang

        Activity

          People

            Unassigned Unassigned
            AuthurWang Authur Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: