Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2104

A CrossProductEdge which produces synthetic cross-product parallelism

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.9.1
    • None

    Description

      Instead of producing duplicate data for the synthetic cross-product, to fit into partitions, the amount of net IO can be vastly reduced by a special purpose cross-product data movement edge.

      The Shuffle edge routes each partition's output to a single reducer, while the cross-product edge routes it into a matrix of reducers without actually duplicating the disk data.

      A partitioning scheme with 3 partitions on the lhs and rhs of a join operation can be routed into 9 reducers by performing a cross-product similar to

      (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...]

      This turns a single task cross-product model into a distributed cross product.

      Attachments

        1. Cross product edge design.pdf
          293 kB
          Zhiyuan Yang
        2. Cartesian product edge design.2.pdf
          284 kB
          Zhiyuan Yang

        Issue Links

          Activity

            People

              zhiyuany Zhiyuan Yang
              gopalv Gopal Vijayaraghavan
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: