Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2502

Use equivalence classes when determining where to fragment a plan

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • Impala 2.3.0
    • impala 2.5.1
    • Frontend

    Description

      Equivalence classes can be used to avoid unnecessary exchanges when partitioning/fragmenting a plan. For example in TPC-H Q-13 the current plan includes

      F03:PLAN FRAGMENT [HASH(c_custkey)]
        DATASTREAM SINK [FRAGMENT=F04, EXCHANGE=10, HASH(c_count)]
        04:AGGREGATE
        |  output: count(*)
        |  group by: count(o_orderkey)
        |  hosts=10 per-host-mem=891.04MB
        |  tuple-ids=4 row-size=16B cardinality=53086384
        |
        09:AGGREGATE [FINALIZE]
        |  output: count:merge(o_orderkey)
        |  group by: c_custkey
        |  hosts=10 per-host-mem=89.10MB
        |  tuple-ids=2 row-size=16B cardinality=53086384
        |
        08:EXCHANGE [HASH(c_custkey)]
           hosts=10 per-host-mem=0B
           tuple-ids=2 row-size=16B cardinality=53086384
      
      F02:PLAN FRAGMENT [HASH(o_custkey)]
        DATASTREAM SINK [FRAGMENT=F03, EXCHANGE=08, HASH(c_custkey)]
        03:AGGREGATE
        |  output: count(o_orderkey)
        |  group by: c_custkey
        |  hosts=10 per-host-mem=891.04MB
        |  tuple-ids=2 row-size=16B cardinality=53086384
        |
        02:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]
        |  hash predicates: o_custkey = c_custkey
        |  hosts=10 per-host-mem=37.77MB
        |  tuple-ids=1N,0 row-size=88B cardinality=405000000
        |
        |--07:EXCHANGE [HASH(c_custkey)]
        |     hosts=10 per-host-mem=0B
        |     tuple-ids=0 row-size=8B cardinality=45000000
        |
        06:EXCHANGE [HASH(o_custkey)]
           hosts=10 per-host-mem=0B
           tuple-ids=1 row-size=80B cardinality=405000000
      

      These two fragments have equivalent input partitioning since o_custkey = c_custkey. The fragment below shows what the might be done instead.

      F02:PLAN FRAGMENT [HASH(c_custkey)]
        DATASTREAM SINK [FRAGMENT=F04, EXCHANGE=10, HASH(c_count)]
        04:AGGREGATE
        |  output: count(*)
        |  group by: count(o_orderkey)
        |  hosts=10 per-host-mem=891.04MB
        |  tuple-ids=4 row-size=16B cardinality=53086384
        |
        09:AGGREGATE [FINALIZE]
        |  output: count(o_orderkey)
        |  group by: c_custkey
        |  hosts=10 per-host-mem=891.04MB
        |  tuple-ids=2 row-size=16B cardinality=53086384
        |
        02:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]
        |  hash predicates: o_custkey = c_custkey
        |  hosts=10 per-host-mem=37.77MB
        |  tuple-ids=1N,0 row-size=88B cardinality=405000000
        |
        |--07:EXCHANGE [HASH(c_custkey)]
        |     hosts=10 per-host-mem=0B
        |     tuple-ids=0 row-size=8B cardinality=45000000
        |
        06:EXCHANGE [HASH(o_custkey)]
           hosts=10 per-host-mem=0B
           tuple-ids=1 row-size=80B cardinality=405000000
      

      Attachments

        Activity

          People

            tarmstrong Tim Armstrong
            caseyc casey
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: