Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1152

Optimize broadcast join for scalability

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None

      Description

      Two main issues for large queries using broadcast shuffle

      1. Lots of tasks communicate to same node for downloading shuffle data. So most of the time, single machine will be overloaded with requests.

      2. Tasks pertaining to same job (in the same machine) downloads broadcast shuffle data redundantly. If the data can be copied to temp storage or ramfs, other tasks running in the same machine can refer to the local copy. Optimizing this would help when running multiple queries in parallel in the cluster.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rajesh.balamohan Rajesh Balamohan
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: