Skew join also is impacted (TEZC-Union-6) Is this expected ?
Yes. Similar to the small table in replicate join, SkewedJoin sample that was broadcast to union vertex now needs to be broadcast to all the union predecessors.
The new DAG for TEZ-Union-4 is as following. But I think v4 is not necessary, just group v2 and v3 together as one vertex group should be enough.
Currently without this patch the plan is v2(load),v3(load),v1(small table)
>v4(union vertex). With this patch, the plan should be v1>v2,v3 where v2 and v3 are in one vertex group for union and there is no v4. But v1 output is written twice - once for v2 and once for v3. See plan for TEZC-Union-4.gld.
Tez vertex scope-40 -> Tez vertex scope-34,Tez vertex scope-35,
Tez vertex scope-34 -> Tez vertex scope-41,
Tez vertex scope-35 -> Tez vertex scope-41,
Tez vertex scope-41
Tez vertex scope-41 is the vertex group and there is only v1,v2 and v3. May be you are looking at TEZC-Union-4-OPTOFF.gld where the UnionOptimizer is turned off.
Once shared edge is done, patch should be re-written to make v1 write small table once, but send to both v2 and v3. It will have to be a new optimizer that runs before UnionOptimizer. UnionOptimizer creates vertex groups for consuming input. The new optimizer will create vertex group for sending outputs.