Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11576

Data loss in MapJoin

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 1.2.0
    • None
    • None
    • None

    Description

      In query (TPC-H query4)

      query4.sql
      create table q4_result as 
      select 
      o_orderpriority, 
      count(*) as order_count 
      from 
      orders o 
      join 
      ( 
      select 
      distinct l_orderkey 
      from 
      ( 
      select 
      * 
      from 
      lineitem 
      where 
      l_commitdate < l_receiptdate 
      ) tab1 
      ) tab2 
      on tab2.l_orderkey = o.o_orderkey 
      where 
      o.o_orderdate >= '1993-07-01' and o.o_orderdate < '1993-10-01' 
      group by 
      o_orderpriority 
      order by 
      o_orderpriority;
      

      The query will cause data-loss if MapJoin is enabled. Both side of join have expected output but some data can't be joined together here. After disabling auto convert join, the problem is gone.

      Context:
      l_orderkey & o_orderkey are bigint.
      vectorized execution enabled.
      execution engine is tez.

      Attachments

        Activity

          People

            mmccline Matt McCline
            tedxu Ted Xu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: