Details
-
Sub-task
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Consider a series of joins or group by on dataset A with few datasets that takes 10 hours followed by a final join with a dataset X. The vertex that loads dataset X will be one of the top vertexes and initialized early even though its output is not consumed till the end after 10 hours.
1) Could either use delayed start logic for better resource allocation
2) Else if they are started upfront, need to handle failure/recovery cases where the nodes which executed the MapTask might have gone down when the final join happens.