I tried a synthetic benchmark (without much input data) with the tez app. This was tried to understand the bare minimum time taken by Tez for container launch / reuse / scheduling etc.
Profiling DAGAppMaster showed that lots of CPU time was spent on VertexImpl.getTask(int) which gets accessed as a part of event handling and transitions.
This problem would more prevalent in large jobs which has got lots of small tasks.
I will attach the perf SVG output of the DAG soon.