Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2440

Full outer join using non-partitioned HJ can incorrectly produce extra row of nulls

    XMLWordPrintableJSON

Details

    Description

      When a full outer join on the old (non-partitioned) HashJoinNode, if any join fragment has 0 build rows and 0 probe rows an extra null row will be produced;

      In the following example running on a mini cluster with 3 impalads, the following query produces incorrect results because one join fragment instance has 0 build rows and 0 probe rows, and it produces an extra row with null tuples:

      Query: select a.id, b.id
      from
      (select id
      from functional.alltypestiny where id between 0 and 2) a
      full outer join
      (select id
      from functional.alltypestiny where id between 1 and 3) b
      on (a.id = b.id)
      +------+------+
      | id   | id   |
      +------+------+
      | NULL | NULL |
      | 0    | NULL |
      | 1    | 1    |
      | 2    | 2    |
      | NULL | 3    |
      +------+------+
      Fetched 5 row(s) in 0.96s
      

      The first result (NULL, NULL) should not be there.

      Query: explain select a.id, b.id
      from
      (select id
      from functional.alltypestiny where id between 0 and 2) a
      full outer join
      (select id
      from functional.alltypestiny where id between 1 and 3) b
      on (a.id = b.id)
      +--------------------------------------------------------------+
      | Explain String                                               |
      +--------------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=32.00MB VCores=2     |
      |                                                              |
      | F03:PLAN FRAGMENT [UNPARTITIONED]                            |
      |   05:EXCHANGE [UNPARTITIONED]                                |
      |      hosts=3 per-host-mem=unavailable                        |
      |      tuple-ids=0N,2N row-size=8B cardinality=2               |
      |                                                              |
      | F02:PLAN FRAGMENT [HASH(id)]                                 |
      |   DATASTREAM SINK [FRAGMENT=F03, EXCHANGE=05, UNPARTITIONED] |
      |   02:HASH JOIN [FULL OUTER JOIN, PARTITIONED]                |
      |   |  hash predicates: id = id                                |
      |   |  hosts=3 per-host-mem=1B                                 |
      |   |  tuple-ids=0N,2N row-size=8B cardinality=2               |
      |   |                                                          |
      |   |--04:EXCHANGE [HASH(id)]                                  |
      |   |     hosts=3 per-host-mem=0B                              |
      |   |     tuple-ids=2 row-size=4B cardinality=1                |
      |   |                                                          |
      |   03:EXCHANGE [HASH(id)]                                     |
      |      hosts=3 per-host-mem=0B                                 |
      |      tuple-ids=0 row-size=4B cardinality=1                   |
      |                                                              |
      | F01:PLAN FRAGMENT [RANDOM]                                   |
      |   DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=04, HASH(id)]      |
      |   01:SCAN HDFS [functional.alltypestiny, RANDOM]             |
      |      partitions=4/4 files=4 size=460B                        |
      |      predicates: id >= 1, id <= 3                            |
      |      table stats: 8 rows total                               |
      |      column stats: all                                       |
      |      hosts=3 per-host-mem=32.00MB                            |
      |      tuple-ids=2 row-size=4B cardinality=1                   |
      |                                                              |
      | F00:PLAN FRAGMENT [RANDOM]                                   |
      |   DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=03, HASH(id)]      |
      |   00:SCAN HDFS [functional.alltypestiny, RANDOM]             |
      |      partitions=4/4 files=4 size=460B                        |
      |      predicates: id >= 0, id <= 2                            |
      |      table stats: 8 rows total                               |
      |      column stats: all                                       |
      |      hosts=3 per-host-mem=32.00MB                            |
      |      tuple-ids=0 row-size=4B cardinality=1                   |
      +--------------------------------------------------------------+
      Fetched 41 row(s) in 4.78s
      

      Attachments

        Activity

          People

            mjacobs Matthew Jacobs
            mjacobs Matthew Jacobs
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: