Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: tez
    • Labels:
      None

      Description

      Replicate join input that was broadcast to union vertex now needs to be broadcast to all the union predecessors. So we need to

      • Create edges from the Replicate join input to all the union predecessors
      • Change replicate join input to write to multiple outputs.

      This can be further optimized by using a shared edge which is yet to be implemented in Tez (TEZ-391)

      1. PIG-3856-1.patch
        48 kB
        Rohini Palaniswamy

        Issue Links

          Activity

          Rohini Palaniswamy created issue -
          Rohini Palaniswamy made changes -
          Field Original Value New Value
          Fix Version/s tez-branch [ 12324968 ]
          Rohini Palaniswamy made changes -
          Link This issue requires TEZ-391 [ TEZ-391 ]
          Hide
          Rohini Palaniswamy added a comment -

          Attached patch has the required changes mentioned in description except for optimizing further with Tez Shared edge. Cheolsoo ran one of the Netflix productions scripts with the patch, but found that the performance degrade a bit. This is most likely due to writing the same replicated join table multiple times to different outputs. So have just uploaded the patch for now. Will make required changes once shared edges are available and then have this committed.

          Also realized that the vertex caching is applicable only for 1 vertex. In this case same replicated join table can be cached for more than 1 vertex. Candidate for another feature request ask in Tez.

          Show
          Rohini Palaniswamy added a comment - Attached patch has the required changes mentioned in description except for optimizing further with Tez Shared edge. Cheolsoo ran one of the Netflix productions scripts with the patch, but found that the performance degrade a bit. This is most likely due to writing the same replicated join table multiple times to different outputs. So have just uploaded the patch for now. Will make required changes once shared edges are available and then have this committed. Also realized that the vertex caching is applicable only for 1 vertex. In this case same replicated join table can be cached for more than 1 vertex. Candidate for another feature request ask in Tez.
          Rohini Palaniswamy made changes -
          Attachment PIG-3856-1.patch [ 12646953 ]
          Rohini Palaniswamy made changes -
          Link This issue requires TEZ-1153 [ TEZ-1153 ]
          Daniel Dai made changes -
          Component/s tez [ 12321016 ]
          Daniel Dai made changes -
          Fix Version/s 0.14.0 [ 12326954 ]
          Fix Version/s tez-branch [ 12324968 ]
          Rohini Palaniswamy made changes -
          Fix Version/s 0.14.0 [ 12326954 ]
          Hide
          Jeff Zhang added a comment -

          Rohini Palaniswamy Quick go through your patch, and have 2 questions :

          • Skew join also is impacted (TEZC-Union-6) Is this expected ?
          • The new DAG for TEZ-Union-4 is as following :
                 v1
                /   \
              /      \
             v2    v3
              \     /
               \  /
                v4
            

            But I think v4 is not necessary, just group v2 and v3 together as one vertex group should be enough.

          Show
          Jeff Zhang added a comment - Rohini Palaniswamy Quick go through your patch, and have 2 questions : Skew join also is impacted (TEZC-Union-6) Is this expected ? The new DAG for TEZ-Union-4 is as following : v1 / \ / \ v2 v3 \ / \ / v4 But I think v4 is not necessary, just group v2 and v3 together as one vertex group should be enough.
          Hide
          Rohini Palaniswamy added a comment -

          Skew join also is impacted (TEZC-Union-6) Is this expected ?

          Yes. Similar to the small table in replicate join, SkewedJoin sample that was broadcast to union vertex now needs to be broadcast to all the union predecessors.

          The new DAG for TEZ-Union-4 is as following. But I think v4 is not necessary, just group v2 and v3 together as one vertex group should be enough.

          Currently without this patch the plan is v2(load),v3(load),v1(small table)>v4(union vertex). With this patch, the plan should be v1>v2,v3 where v2 and v3 are in one vertex group for union and there is no v4. But v1 output is written twice - once for v2 and once for v3. See plan for TEZC-Union-4.gld.

          Tez vertex scope-40	->	Tez vertex scope-34,Tez vertex scope-35,
          Tez vertex scope-34	->	Tez vertex scope-41,
          Tez vertex scope-35	->	Tez vertex scope-41,
          Tez vertex scope-41
          

          Tez vertex scope-41 is the vertex group and there is only v1,v2 and v3. May be you are looking at TEZC-Union-4-OPTOFF.gld where the UnionOptimizer is turned off.

          Once shared edge is done, patch should be re-written to make v1 write small table once, but send to both v2 and v3. It will have to be a new optimizer that runs before UnionOptimizer. UnionOptimizer creates vertex groups for consuming input. The new optimizer will create vertex group for sending outputs.

          Show
          Rohini Palaniswamy added a comment - Skew join also is impacted (TEZC-Union-6) Is this expected ? Yes. Similar to the small table in replicate join, SkewedJoin sample that was broadcast to union vertex now needs to be broadcast to all the union predecessors. The new DAG for TEZ-Union-4 is as following. But I think v4 is not necessary, just group v2 and v3 together as one vertex group should be enough. Currently without this patch the plan is v2(load),v3(load),v1(small table) >v4(union vertex). With this patch, the plan should be v1 >v2,v3 where v2 and v3 are in one vertex group for union and there is no v4. But v1 output is written twice - once for v2 and once for v3. See plan for TEZC-Union-4.gld. Tez vertex scope-40 -> Tez vertex scope-34,Tez vertex scope-35, Tez vertex scope-34 -> Tez vertex scope-41, Tez vertex scope-35 -> Tez vertex scope-41, Tez vertex scope-41 Tez vertex scope-41 is the vertex group and there is only v1,v2 and v3. May be you are looking at TEZC-Union-4-OPTOFF.gld where the UnionOptimizer is turned off. Once shared edge is done, patch should be re-written to make v1 write small table once, but send to both v2 and v3. It will have to be a new optimizer that runs before UnionOptimizer. UnionOptimizer creates vertex groups for consuming input. The new optimizer will create vertex group for sending outputs.
          Hide
          Jeff Zhang added a comment -

          Rohini Palaniswamy Thanks for your explanation, maybe rename Tez vertex scope-41 to Tez vertex group scope-41 would be better, otherwise a little confusing. I am working on shared edge on Tez, once it is ready, will let you know to try that.

          Show
          Jeff Zhang added a comment - Rohini Palaniswamy Thanks for your explanation, maybe rename Tez vertex scope-41 to Tez vertex group scope-41 would be better, otherwise a little confusing. I am working on shared edge on Tez, once it is ready, will let you know to try that.
          Hide
          Rohini Palaniswamy added a comment -

          No problem. We actually print it as Tez vertex group in Pig 0.14. But this patch was done before that change went in.

          Show
          Rohini Palaniswamy added a comment - No problem. We actually print it as Tez vertex group in Pig 0.14. But this patch was done before that change went in.

            People

            • Assignee:
              Unassigned
              Reporter:
              Rohini Palaniswamy
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development