Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-9146

Query with left joins produces wrong result when join condition is written in different order

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.13.1
    • 0.14.0
    • Logical Optimizer
    • None

    Description

      I have two queries which should be equal (I only swap two join conditions) but they are not. They are simplest queries I could produce to reproduce bug.

      I have two simple tables:

      desc kgorlo_comm;

      col_name data_type comment
      id bigint  
      dest_id bigint  

      desc kgorlo_log;

      col_name data_type comment
      id bigint  
      dest_id bigint  
      tstamp bigint  

      With data:

      select * from kgorlo_comm;

      kgorlo_comm.id kgorlo_comm.dest_id
      1 2
      2 1
      1 3
      2 3
      3 5
      4 5

      select * from kgorlo_log;

      kgorlo_log.id kgorlo_log.dest_id kgorlo_log.tstamp
      1 2 0
      1 3 0
      1 5 0
      3 1 0

      And when I run this query (query no. 1):

      select log.id, log.dest_id, com1.msgs, com2.msgs from kgorlo_log log
      left outer join (select id, dest_id, count( * ) as msgs from kgorlo_comm group by id, dest_id)com1 on com1.id=log.id and com1.dest_id=log.dest_id
      left outer join (select id, dest_id, count( * ) as msgs from kgorlo_comm group by id, dest_id)com2 on com2.dest_id=log.id and com2.id=log.dest_id;

      I get result (which is correct):

      log.id log.dest_id com1.msgs com2.msgs
      1 2 1 1
      1 3 1 NULL
      1 5 NULL NULL
      3 1 NULL 1

      But when I run second query (query no. 2):

      select log.id, log.dest_id, com1.msgs, com2.msgs from kgorlo_log log
      left outer join (select id, dest_id, count( * ) as msgs from kgorlo_comm group by id, dest_id)com1 on com1.id=log.id and com1.dest_id=log.dest_id
      left outer join (select id, dest_id, count( * ) as msgs from kgorlo_comm group by id, dest_id)com2 on com2.id=log.dest_id and com2.dest_id=log.id;

      I get different (and bad, in my opinion) result:

      log.id log.dest_id com1.msgs com2.msgs
      1 2 1 1
      1 3 1 1
      1 5 NULL NULL
      3 1 NULL NULL

      Query no. 1 and query no. 2 are different in only one place, it is second join condition:
      bf. com2.dest_id=log.id and com2.id=log.dest_id
      vs
      bf. com2.id=log.dest_id and com2.dest_id=log.id

      which in my opinion are equal.

      Explains for both queries are of course slightly different (columns are swapped) and they are here:

      https://gist.github.com/kgs/399ad7ca2c481bd2c018 (query no. 1, good)
      https://gist.github.com/kgs/bfb3216f0f1fbc28037e (query no. 2, bad)

      Attachments

        Issue Links

          Activity

            People

              ashutoshc Ashutosh Chauhan
              kgs Kamil Gorlo
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: