Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24199

Incorrect result when subquey in exists contains limit

    XMLWordPrintableJSON

Details

    Description

      create table web_sales (ws_order_number int, ws_warehouse_sk int) stored as orc;
      
      insert into web_sales values
      (1, 1),
      (1, 2),
      (2, 1),
      (2, 2);
      
      select * from web_sales ws1
      where exists (select 1 from web_sales ws2 where ws1.ws_order_number = ws2.ws_order_number limit 1);
      1	1
      1	2
      
      CBO PLAN:
      HiveSemiJoin(condition=[=($0, $2)], joinType=[semi])
        HiveProject(ws_order_number=[$0], ws_warehouse_sk=[$1])
          HiveFilter(condition=[IS NOT NULL($0)])
            HiveTableScan(table=[[default, web_sales]], table:alias=[ws1])
        HiveProject(ws_order_number=[$0])
          HiveSortLimit(fetch=[1])          <-- This shouldn't be added
            HiveProject(ws_order_number=[$0])
              HiveFilter(condition=[IS NOT NULL($0)])
                HiveTableScan(table=[[default, web_sales]], table:alias=[ws2])
      

      Limit n on the right side of the join reduces the result set coming from the right to only n record hence not all the ws_order_number values are included which leads to correctness issue.

      Attachments

        Issue Links

          Activity

            People

              kkasa Krisztian Kasa
              kkasa Krisztian Kasa
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m