Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2104

A CrossProductEdge which produces synthetic cross-product parallelism

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.9.1
    • None

    Description

      Instead of producing duplicate data for the synthetic cross-product, to fit into partitions, the amount of net IO can be vastly reduced by a special purpose cross-product data movement edge.

      The Shuffle edge routes each partition's output to a single reducer, while the cross-product edge routes it into a matrix of reducers without actually duplicating the disk data.

      A partitioning scheme with 3 partitions on the lhs and rhs of a join operation can be routed into 9 reducers by performing a cross-product similar to

      (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...]

      This turns a single task cross-product model into a distributed cross product.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zhiyuany Zhiyuan Yang
            gopalv Gopal Vijayaraghavan
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment