Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-16499

[Tez] CommonMergeJoin Operator is taking longer to join rows as compared to MR

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.2.0, 1.3.0
    • None
    • None
    • None

    Description

      It can be reproduced by a reduce side join (Using the patch available in HIVE-16498 as reading useless data will mask the longer time taken issue described here).
      The data for large_table is generated by following shell script and a table can be created from the file `large.txt`

      for (( j=1 ; j <=20; j++))
      do
        for (( i=1; i <= 1000000; i++ ))
        do
          echo "$i,$j" >> large.txt
        done
      done
      
      create external table large_table ( i int, j int) row format delimited fields terminated by ',' location "hdfs://<some-hdfs-location>";
      
      set hive.auto.convert.join=false; -- So that reduce side join is used instead of MapJoin
      
      select * from large_table a join large_table b on a,j = b.j limit 100;
      

      The issue is different from HIVE-16498 as Tez is taking time in join operator instead of reading extra data.
      Applied the patch available for HIVE-16498 and ran the above join query. It is taking around 30-40 minutes as compared to 5 minutes on MR.

      Attachments

        Activity

          People

            Unassigned Unassigned
            adeshrao Adesh Kumar Rao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: