Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9083

Explore co-partitioning join strategies

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 3.4.0
    • Fix Version/s: None
    • Component/s: Frontend
    • Labels:
      None
    • Epic Color:
      ghx-label-5

      Description

      The idea of co-partitioning join is not new and it has been implemented in other systems (see link and link). With HDFS, we may not have as much control over the block locations so it's not as easy to make assumption about the co-location of partitions of multiple tables.

      With remote reads (e.g. S3), we don't have this constraint anymore as all data are remote. So, if we are joining on the partition key of two tables, we can create a scan range schedule to co-locate partitions with the same partition values on the same executor and do the join locally to avoid the subsequent shuffling step.

      The potential downside with this idea is that if there are skews in the data distribution, we may overwhelm a few of the nodes in both the scans and join operators. Previously, if we have skew in data distribution, we may still overwhelm the join operator but the scans can be better parallelized. cc'ing David Rorke

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              kwho Michael Ho
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: