Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1152

Optimize broadcast join for scalability

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Two main issues for large queries using broadcast shuffle

      1. Lots of tasks communicate to same node for downloading shuffle data. So most of the time, single machine will be overloaded with requests.

      2. Tasks pertaining to same job (in the same machine) downloads broadcast shuffle data redundantly. If the data can be copied to temp storage or ramfs, other tasks running in the same machine can refer to the local copy. Optimizing this would help when running multiple queries in parallel in the cluster.

      Attachments

        There are no Sub-Tasks for this issue.

        Activity

          People

            Unassigned Unassigned
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: