Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4857

Handle large # of duplicate keys on build side of a spilling hash join

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • Impala 2.9.0
    • None
    • Backend

    Description

      Currently the hash join implementation relies on recursively repartitioning the build side until a single partition can fit entirely in memory. This works well in many cases, but can fail if there are a large number of rows with duplicate keys that does not fit in the available memory.

      This results in an error like: "Cannot perform hash join at node with id 6. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 6. Number of rows 275352"

      A special case of this is a Null-aware anti join with many NULLs on the build side.

      This error often occurs because of a suboptimal query or plan that has a lot of duplicate values on one side of the join. Changing the join operator to spill in many of these cases would result in the query running to completion, but very slowly (since it needs to do a quadratic pairwise comparison of both sides of the join).

      Attachments

        Issue Links

          Activity

            People

              kdeschle Kurt Deschler
              tarmstrong Tim Armstrong
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: