Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10287

Distribution strategy is sub-optimal for certain queries

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 3.4.0
    • Impala 4.0.0
    • Frontend
    • None
    • ghx-label-13

    Description

      I ran a simplified query (extracted from q78 of TPC-DS) on a 600GB dataset on an 8 node cluster. I forced the distribution strategy for the left outer join and compared Broadcast vs Hash Partition for different values of mt_dop. The example query and results are shown below (elapsed times are in seconds):

      Query (with shuffle or broadcast hint):
      select count(*)
         from store_sales
         left join [shuffle] store_returns on sr_ticket_number=ss_ticket_number 
               and ss_item_sk=sr_item_sk
         join date_dim on ss_sold_date_sk = d_date_sk
         where sr_ticket_number is null
         and d_year=2002;
      
      mt_dop Broadcast Partition
      1 45 15
      2 37 9
      4 33 5
      8 31 4
      12 31 4

      Given the nearly 7.5x speedup for partition distribution at mt_dop = 12 (which is the default), it indicates that the cost formula comparing the broadcast vs partition needs to be modified to take into account the mt_dop.

      Attachments

        Issue Links

          Activity

            People

              amansinha Aman Sinha
              amansinha Aman Sinha
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: