Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-27226

FullOuterJoin with filter expressions is not computed correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.0-beta-1
    • None
    • Hive

    Description

      I tested many OuterJoin queries as an extension of HIVE-27138, and I found that Hive returns incorrect result for a query containing FullOuterJoin with filter expressions. In a nutshell, all JoinOperators that run on Tez engine return incorrect result for OuterJoin queries, and one of the reason for incorrect computation comes from CommonJoinOperator, which is the base of all JoinOperators. I attached the queries and configuration that I used at the bottom of the document. I am still inspecting this problems, and I will share an update once when I find out another reason. Also any comments and opinions would be appreciated.

      First of all, I observed that current Hive ignores filter expressions contained in MapJoinOperator. For example, the attached result of query1 shows that MapJoinOperator performs inner join, not full outer join. This problem stems from removal of filterMap. When converting JoinOperator to MapJoinOperator, ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() removes filterMap of MapJoinOperator. Because MapJoinOperator does not evaluate filter expressions if filterMap is null, this change makes MapJoinOperator ignore filter expressions and it always joins tables regardless whether they satisfy filter expressions or not. To solve this problem, I disable FullOuterMapJoinOptimization and apply path for HIVE-27138, which prevents NPE. (The patch is available at the following link: LINK.) The rest of this document uses this modified Hive, but most of problems happen to current Hive, too.

      The second problem I found is that Hive returns the same left-null or right-null rows multiple time when it uses MapJoinOperator or CommonMergeJoinOperator. This is caused by the logic of current CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps. First, they create RowContainers, each of which is a group of rows from one table and has the same key. Second, they call CommonJoinOperator#checkAndGenObject() with created RowContainers. This method checks filterTag of each row in RowContainers and forwards joined row if they meet all filter conditions. For OuterJoin, checkAndGenObject() forwards non-matching rows if there is no matching row in RowContainer. The problem happens when there are multiple RowContainer for the same key and table. For example, suppose that there are two left RowContainers and one right RowContainer. If none of the row in two left RowContainers satisfies filter condition, then checkAndGenObject() will forward Left-Null row for each right row. Because checkAndGenObject() is called with each left RowContainer, there will be two duplicated Left-Null rows for every right row.

      In the case of MapJoinOperator, it always creates singleton RowContainer for big table. Therefore, it always produces duplicated non-matching rows. CommonMergeJoinOperator also creates multiple RowContainer for big table, whose size is hive.join.emit.interval. In the below experiment, I also set hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to disable specialized algorithm for OuterJoin of 2 tables and force calling checkAndGenObject() before all rows with the same keys are gathered. I didn't observe this problem when using VectorMapJoinOperator, and I will inspect VectorMapJoinOperator whether we can reproduce the problem with it.

      I think the second problem is not limited to FullOuterJoin, but I couldn't find such query as of now. This will also be added to this issue if I can write a query that reproduces the second problem without FullOuterJoin.

      I also found that Hive returns wrong result for query2 even when I used VectorMapJoinOperator. I am still inspecting this problem and I will add an update on it when I find out the reason.

       

      Experiment:

       

      ---- Configuration
      set hive.optimize.shared.work=false;
      
      -- Std MapJoin
      set hive.auto.convert.join=true;
      set hive.vectorized.execution.enabled=false;
      
      -- Vec MapJoin
      set hive.auto.convert.join=true;
      set hive.vectorized.execution.enabled=true;
      
      -- MergeJoin
      set hive.auto.convert.join=false;
      set hive.vectorized.execution.enabled=false;
      set hive.join.shortcut.unmatched.rows=false;
      set hive.join.emit.interval=1;
      set hive.exec.reducers.max=1;
       
      ---- Queries
      -- Query 1
      DROP TABLE IF EXISTS a;
      CREATE TABLE a (key string, value string);
      INSERT INTO a VALUES (1, 1), (1, 2), (2, 1);
      SELECT * FROM a FULL OUTER JOIN a b ON a.key = b.key AND a.key < 0;
      
      -- Query 2
      DROP TABLE IF EXISTS b;
      CREATE TABLE b (key string, value string);
      INSERT INTO b VALUES (1, 0), (1, 1);
      SELECT * FROM b FULL OUTER JOIN b a ON a.key = b.key AND a.value > 0 AND b.value > 0;

       

       

      Experiment result:

       

      -- PostgresSQL
      -- Query1
      key | value | key | value
      -----+-------+-----+-------
         1 |     1 |     |
         1 |     2 |     |
         2 |     1 |     |
           |       |   1 |     2
           |       |   1 |     1
           |       |   2 |     1
      (6 rows)
      
      -- Query2
       key | value | key | value
      -----+-------+-----+-------
         1 |     0 |     |      
         1 |     1 |   1 |     1
           |       |   1 |     0
      (3 rows)
      -- Query1 Result, current Hive
      -- Std MapJoin
      +--------+----------+--------+----------+
      | a.key  | a.value  | b.key  | b.value  |
      +--------+----------+--------+----------+
      | 2      | 1        | 2      | 1        |
      | 1      | 2        | 1      | 2        |
      | 1      | 2        | 1      | 1        |
      | 1      | 1        | 1      | 2        |
      | 1      | 1        | 1      | 1        |
      +--------+----------+--------+----------+
      -- Vec MapJoin
      +--------+----------+--------+----------+
      | a.key  | a.value  | b.key  | b.value  |
      +--------+----------+--------+----------+
      | 1      | 2        | NULL   | NULL     |
      | 1      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 2        |
      | NULL   | NULL     | 1      | 1        |
      | 2      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 2      | 1        |
      +--------+----------+--------+----------+
      -- MergeJoin
      +--------+----------+--------+----------+
      | a.key  | a.value  | b.key  | b.value  |
      +--------+----------+--------+----------+
      | 1      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 1        |
      | NULL   | NULL     | 1      | 2        |
      | 1      | 2        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 1        |
      | NULL   | NULL     | 1      | 2        |
      | 2      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 2      | 1        |
      +--------+----------+--------+----------+ 
      -- Query1 Result, Hive with HIVE-27138 patch, disable FullOuterMapJoinOptimization
      -- Std MapJoin
      +--------+----------+--------+----------+
      | a.key  | a.value  | b.key  | b.value  |
      +--------+----------+--------+----------+
      | 2      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 2      | 1        |
      | 1      | 2        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 2        |
      | NULL   | NULL     | 1      | 1        |
      | 1      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 2        |
      | NULL   | NULL     | 1      | 1        |
      +--------+----------+--------+----------+
      -- Vec MapJoin
      +--------+----------+--------+----------+
      | a.key  | a.value  | b.key  | b.value  |
      +--------+----------+--------+----------+
      | 1      | 2        | NULL   | NULL     |
      | 1      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 2        |
      | NULL   | NULL     | 1      | 1        |
      | 2      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 2      | 1        |
      +--------+----------+--------+----------+
      -- MergeJoin
      +--------+----------+--------+----------+
      | a.key  | a.value  | b.key  | b.value  |
      +--------+----------+--------+----------+
      | 1      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 1        |
      | NULL   | NULL     | 1      | 2        |
      | 1      | 2        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 1        |
      | NULL   | NULL     | 1      | 2        |
      | 2      | 1        | NULL   | NULL     |
      | NULL   | NULL     | 2      | 1        |
      +--------+----------+--------+----------+ 
      -- Query2 Result, current Hive
      -- Std MapJoin
      +--------+----------+--------+----------+
      | b.key  | b.value  | a.key  | a.value  |
      +--------+----------+--------+----------+
      | 1      | 0        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 0        |
      | NULL   | NULL     | 1      | 1        |
      | 1      | 1        | 1      | 0        |
      | 1      | 1        | 1      | 1        |
      +--------+----------+--------+----------+
      -- Vec MapJoin
      +--------+----------+--------+----------+
      | b.key  | b.value  | a.key  | a.value  |
      +--------+----------+--------+----------+
      | 1      | 0        | NULL   | NULL     |
      | 1      | 1        | 1      | 1        |
      | 1      | 1        | 1      | 0        |
      +--------+----------+--------+----------+
      -- MergeJoin
      +--------+----------+--------+----------+
      | b.key  | b.value  | a.key  | a.value  |
      +--------+----------+--------+----------+
      | 1      | 0        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 0        |
      | 1      | 1        | 1      | 1        |
      +--------+----------+--------+----------+ 
      -- Query2 Result, Hive with HIVE-27138 patch, disable FullOuterMapJoinOptimization
      -- Std MapJoin
      +--------+----------+--------+----------+
      | b.key  | b.value  | a.key  | a.value  |
      +--------+----------+--------+----------+
      | 1      | 1        | 1      | 1        |
      | 1      | 1        | 1      | 0        |
      | 1      | 0        | 1      | 1        |
      | 1      | 0        | 1      | 0        |
      +--------+----------+--------+----------+
      -- Vec MapJoin
      +--------+----------+--------+----------+
      | b.key  | b.value  | a.key  | a.value  |
      +--------+----------+--------+----------+
      | 1      | 0        | NULL   | NULL     |
      | 1      | 1        | 1      | 0        |
      | 1      | 1        | 1      | 1        |
      +--------+----------+--------+----------+
      -- MergeJoin
      +--------+----------+--------+----------+
      | b.key  | b.value  | a.key  | a.value  |
      +--------+----------+--------+----------+
      | 1      | 0        | NULL   | NULL     |
      | NULL   | NULL     | 1      | 0        |
      | 1      | 1        | 1      | 1        |
      +--------+----------+--------+----------+ 

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              seonggon Seonggon Namgung
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: