Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2203

CTAS and IAS produce different results vs SELECT when using FULL JOIN

    XMLWordPrintableJSON

Details

    Description

      where CTAS and IAS produce different results as a plain SELECT when using FULL JOINs and static value columns.

      It's a regression. Seen in Impala 2.1.3+ (starting in CDH 5.3.5) but not earlier version.

      Performing the same steps in Hive produce expected results. Also verified expected results in postgresql.

      Example:

      create table raw (id STRING, val STRING);
      insert into raw values ('user1','one'), ('user2','two');
      select * from raw;
      +-------+-----+
      | id | val |
      +-------+-----+
      | user1 | one |
      | user2 | two |
      +-------+-----+
      
      select a.id as a_id, b.id as b_id, one, two from ( select id, 'Y' as one from raw where val = 'one' group by id ) a full join ( select id, 'Y' as two from raw where val = 'two' group by id ) b on a.id = b.id;
      +-------+-------+------+------+
      | a_id | b_id | one | two |
      +-------+-------+------+------+
      | user1 | NULL | Y | NULL |
      | NULL | user2 | NULL | Y |
      +-------+-------+------+------+
      
      create table test1 as select a.id as a_id, b.id as b_id, one, two from ( select id, 'Y' as one from raw where val = 'one' group by id ) a full join ( select id, 'Y' as two from raw where val = 'two' group by id ) b on a.id = b.id;
      
      select * from test1;
      +-------+-------+-----+-----+
      | a_id | b_id | one | two |
      +-------+-------+-----+-----+
      | user1 | NULL | Y | Y |
      | NULL | user2 | Y | Y |
      +-------+-------+-----+-----+
      

      Testing using INSERT also produces the same incorrect result.

      Workaround
      To produce correct results, you can rewrite the with a UNION ALL to force the inline-view results to be materialized.
      Before CTAS with incorrect results:

      create table test1 as select a.id as a_id, b.id as b_id, one, two from
        (select id, 'Y' as one from raw where val = 'one' group by id) a
      full join
        (select id, 'Y' as two from raw where val = 'two' group by id) b
      on a.id = b.id;
      

      After CTAS with correct results:

      create table test1 as select a.id as a_id, b.id as b_id, one, two from
        (select id, 'Y' as one from raw where val = 'one' group by id
         union all <-- This is the workaround
         select NULL, NULL from raw where false) a <-- This is the workaround
      full join
        (select id, 'Y' as two from raw where val = 'two' group by id
         union all <-- This is the workaround 
         select NULL, NULL from raw where false) b <-- This is the workaround
      on a.id = b.id;
      

      Attachments

        Activity

          People

            alex.behm Alexander Behm
            jyu@cloudera.com Juan Yu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: