Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20304

When hive.optimize.skewjoin and hive.auto.convert.join are both set to true, and the execution engine is mr, same stage may launch twice due to the wrong generated plan

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.2.1, 2.3.3
    • 4.0.0-alpha-2
    • CLI
    • None

    Description

      `When hive.optimize.skewjoin and hive.auto.convert.join are both set to true, and the execution engine is set to mr, same stage of a query may launch twice due to the wrong generated plan. If hive.exec.parallel is also true, the same stage will launch at the same time and the job will failed due to the first completed stage clear the map.xml/reduce.xml file stored in the hdfs.

      use following sql to reproduce the issue:

      CREATE TABLE `tbl1`(
        `fence` string);
      
      CREATE TABLE `tbl2`(
        `order_id` string,
        `phone` string,
        `search_id` string
      )
      PARTITIONED BY (
        `dt` string);
      
      
      CREATE TABLE `tbl3`(
        `order_id` string,
        `platform` string)
      PARTITIONED BY (
        `dt` string);
      
      
      CREATE TABLE `tbl4`(
        `groupname` string,
        `phone` string)
      PARTITIONED BY (
        `dt` string);
      
      
      CREATE TABLE `tbl5`(
        `search_id` string,
        `fence` string)
      PARTITIONED BY (
        `dt` string);
      
      SET hive.exec.parallel = TRUE;
      
      SET hive.auto.convert.join = TRUE;
      
      SET hive.optimize.skewjoin = TRUE;
      
      
      SELECT dt,
                     platform,
                     groupname,
                     count(1) as cnt
              FROM
              (SELECT dt,
                      platform,
                      groupname
               FROM
               (SELECT fence
                FROM tbl1)ta
                 JOIN
                 (SELECT a0.dt,
                         a1.platform,
                         a2.groupname,
                         a3.fence
                  FROM
                  (SELECT dt,
                          order_id,
                          phone,
                          search_id
                   FROM tbl2
                   WHERE dt =20180703 )a0
                    JOIN
                    (SELECT order_id,
                            platform,
                            dt
                     FROM tbl3
                     WHERE dt =20180703 )a1 ON a0.order_id = a1.order_id
                    INNER JOIN
                    (SELECT groupname,
                            phone,
                            dt
                     FROM tbl4
                     WHERE dt =20180703 )a2 ON a0.phone = a2.phone
                    LEFT JOIN
                    (SELECT search_id,
                            fence,
                            dt
                     FROM tbl5
                     WHERE dt =20180703)a3 ON a0.search_id = a3.search_id)t0 ON ta.fence = t0.fence)t11
              GROUP BY dt,
                       platform,
                       groupname;
      
      DROP TABLE tbl1;
      DROP TABLE tbl2;
      DROP TABLE tbl3;
      DROP TABLE tbl4;
      DROP TABLE tbl5;
      
      

      We will get some error message like this:

      Examining task ID: task_1531284442065_3637_m_000000 (and more) from job job_1531284442065_3637

      Task with the most failures(4):


      Task ID:
      task_1531284442065_3637_m_000000

      URL:
      http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1531284442065_3637&tipid=task_1531284442065_3637_m_000000


      Diagnostic Messages for this Task:
      File does not exist: hdfs://test/tmp/hive-hadoop/hadoop/fe5efa94-abb1-420f-b6ba-ec782e7b79ad/hive_2018-08-03_17-00-17_707_592882314975289971-5/-mr-10045/757eb1f7-7a37-4a7e-abc0-4a3b8b06510c/reduce.xml
      java.io.FileNotFoundException: File does not exist: hdfs://test/tmp/hive-hadoop/hadoop/fe5efa94-abb1-420f-b6ba-ec782e7b79ad/hive_2018-08-03_17-00-17_707_592882314975289971-5/-mr-10045/757eb1f7-7a37-4a7e-abc0-4a3b8b06510c/reduce.xml

      Looking into the plan by executing explain, I found that the Stage-4 and Stage-5 can reached from multi root tasks.

      Explain
      STAGE DEPENDENCIES:
        Stage-21 is a root stage , consists of Stage-34, Stage-5
        Stage-34 has a backup stage: Stage-5
        Stage-20 depends on stages: Stage-34
        Stage-17 depends on stages: Stage-5, Stage-18, Stage-20 , consists of Stage-32, Stage-33, Stage-1
        Stage-32 has a backup stage: Stage-1
        Stage-15 depends on stages: Stage-32
        Stage-10 depends on stages: Stage-1, Stage-15, Stage-16 , consists of Stage-31, Stage-2
        Stage-31
        Stage-9 depends on stages: Stage-31
        Stage-2 depends on stages: Stage-9
        Stage-33 has a backup stage: Stage-1
        Stage-16 depends on stages: Stage-33
        Stage-1
        Stage-5
        Stage-27 is a root stage , consists of Stage-37, Stage-38, Stage-4
        Stage-37 has a backup stage: Stage-4
        Stage-25 depends on stages: Stage-37
        Stage-12 depends on stages: Stage-4, Stage-22, Stage-23, Stage-25, Stage-26 , consists of Stage-36, Stage-5
        Stage-36
        Stage-11 depends on stages: Stage-36
        Stage-19 depends on stages: Stage-11 , consists of Stage-35, Stage-5
        Stage-35 has a backup stage: Stage-5
        Stage-18 depends on stages: Stage-35
        Stage-38 has a backup stage: Stage-4
        Stage-26 depends on stages: Stage-38
        Stage-4
        Stage-30 is a root stage , consists of Stage-42, Stage-43, Stage-3
        Stage-42 has a backup stage: Stage-3
        Stage-28 depends on stages: Stage-42
        Stage-14 depends on stages: Stage-3, Stage-28, Stage-29 , consists of Stage-41, Stage-4
        Stage-41
        Stage-13 depends on stages: Stage-41
        Stage-24 depends on stages: Stage-13 , consists of Stage-39, Stage-40, Stage-4
        Stage-39 has a backup stage: Stage-4
        Stage-22 depends on stages: Stage-39
        Stage-40 has a backup stage: Stage-4
        Stage-23 depends on stages: Stage-40
        Stage-43 has a backup stage: Stage-3
        Stage-29 depends on stages: Stage-43
        Stage-3
        Stage-0 depends on stages: Stage-2
      

      After skewjoin optimization, the processed node is added into the listTasks of ConditionalTask and the parentTask of the processed node is removed and during the commonJoin optimization of listTasks of ConditionalTask, the new generated condTask will be added into root task list due to parentTask is null.

      workaround: do not set hive.optimize.skewjoin and hive.auto.convert.join to true at the same time.

      Attachments

        1. HIVE-20304.patch
          1 kB
          Hui Huang
        2. HIVE-20304.1.patch
          59 kB
          Hui Huang
        3. HIVE-20304.2.patch
          40 kB
          Yongzhi Chen

        Issue Links

          Activity

            People

              BIGrey Hui Huang
              BIGrey Hui Huang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: