Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-988

Join strategy (broadcast vs shuffle) decision does not take memory consumption and other joins into account

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • Impala 1.2.1
    • None
    • Frontend

    Description

      The amount of available memory changes the trade-off between partitioned and shuffle join strategies: if switching to shuffle join can avoid spilling to disk, it may be worth paying the cost of the additional network transfer.

      There are two issues:
      1. Join strategy decision only takes query mem-limit into account but ignore process mem-limit.
      2. Join strategy decision does not take other joins of the same query into account. When multiple joins are present, memory consumption can be very high.

      I (tarmstrong@cloudera.com) don't think we should attempt to fix #1 - there's a phase ordering problem here - we currently choose the best-performing plan then decide how much memory to allocate in admission control based on that plan. We can't preserve that while attempting to change the plan to fit the mem_limit. That said, I think the current heuristic is a little too aggressive about picking broadcast when the right side is very large - it should probably bias more towards shuffle as the right side gets larger.

      Note that when IMPALA-3200 is completed, this shouldn't prevent the query running to completion, but still affects performance.

      Attachments

        Activity

          People

            amansinha Aman Sinha
            alan@cloudera.com Alan Choi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: