Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4857

Handle large # of duplicate keys on build side of a spilling hash join

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.9.0
    • Fix Version/s: None
    • Component/s: Backend

      Description

      Currently the hash join implementation relies on recursively repartitioning the build side until a single partition can fit entirely in memory. This works well in many cases, but can fail if there are a large number of rows with duplicate keys that does not fit in the available memory.

      This results in an error like: "Cannot perform hash join at node with id 6. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 6. Number of rows 275352"

      A special case of this is a Null-aware anti join with many NULLs on the build side.

      This error often occurs because of a suboptimal query or plan that has a lot of duplicate values on one side of the join. Changing the join operator to spill in many of these cases would result in the query running to completion, but very slowly (since it needs to do a quadratic pairwise comparison of both sides of the join).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tarmstrong Tim Armstrong
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: