Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Per performance testing done by sapinamin (thus, I'm setting him as reporter), we were able to discover an important bug affecting replication. It has the potential to affect other large DAGs of Tasks that hive generates as well, if those DAGs have multiple paths to child Task nodes.
Basically, we find that incremental REPL LOAD does not finish in a timely fashion. The test, in this case was to add 400 partitions, and replicate them. Associated with each partition, there was an ADD PTN and a ALTER PTN. For each of the ADD PTN tasks, we'd generate a DDLTask, a CopyTask and a MoveTask. For each Alter ptn, there'd be a single DDLTask. And order of execution is important, so it would chain in dependency collection tasks between phases.
Trying to root cause this shows us that it seems to stall forever at the Driver instantiation time, and it almost looks like the thread doesn't proceed past that point.
Looking at logs, it seems that the way this is written, it looks for all tasks generated that are subtrees of all nodes, without looking for duplicates, and this is done simply to get the number of execution tasks!
And thus, the task visitor will visit every subtree of every node, which is fine if you have graphs that look like open trees, but is horrible for us, since we have dependency collection tasks between each phase. Effectively, this is what's happening:
We have a DAG, say, like this:
4 tasks in parallel -> DEP col -> 4 tasks in parallel -> DEP col -> ...
This means that for each of the 4 root tasks, we will do a full traversal of every graph (not just every node) past the DEP col, and this happens recursively, and this leads to an exponential growth of number of tasks visited as the length and breadth of the graph increase. In our case, we had about 800 tasks in the graph, with roughly a width of about 2-3, with 200 stages, a dep collection before and after, and this meant that leaf nodes of this DAG would have something like 2^200 - 3^200 ways in which they can be visited, and thus, we'd visit them in all those ways. And all this simply to count the number of tasks to schedule - we would revisit this function multiple more times, once per each hook, once for the MapReduceCompiler and once for the TaskCompiler.
We have not been sending such large DAGs to the Driver, thus it has not yet been a problem, and there are upcoming changes to reduce the number of tasks replication generates(as part of a memory addressing issue), but we still should fix the way we do Task traversal so that a large DAG cannot cripple us.