Regarding finding the next task to run:
The problem is that we scan the array of TaskInProgress(TIP) objects (the 'tasks' array) from the beginning, each time we need to find a task to run and we haven't found one in the cache. When you have lots and lots of tasks, after some time the tasks array will predominantly be filled with running or completed tasks, especially at lower indexes. So it takes longer and longer to scan the array for virgin tasks (tasks that haven't run yet) or failed tasks if we always start from the beginning. There are a number of ways to get around this:
1. On a cache miss, if you don't mind returning a virgin task ahead of a failed task, even though the failed task occurs earlier in the array than the virgin task, then you can do the following. Keep an index (a pointer) into the tasks array, which points to the first task in the array that is a virgin task. Every TIP to the left of this index will have been run at least once. Every time we need to scan the array, we look for runnable tasks from this index onwards. This index will move across the length of the array just once during the running of all tasks, so it averages out to O(1) for finding the first virgin task. This is fairly easy to code, and should provide significant savings in the cases where there are lots and lots of tasks. If a virgin task is not found, you can use the same code to look for a failed task or a speculative task.
2. If you want to keep the same algorithm for finding the next task, i.e., if you want to return either a virgin or failed task first on a cache miss, then you can keep an index into the array which represents the first failed or virgin task. This approach won't be as fast as the first one, since this index can move back and forth across the array, but it should still be better over what we do today. In both options, the index will have to be potentially updated whenever a task's status changes.
A more detailed approach requires maintaining separate lists for tasks with different states: one for virgin tasks, one for failed, one for completed, and one for running. As a task's status changes, it moves from one list to another. These lists can be sorted (the list of virgin tasks should be sorted in decreasing order of input size; the other lists can be sorted the same way, or perhaps in order of when the tasks ran, which simplifies things). Picking the next task is as simple as walking down the list (in many cases, just picking the first element of a list). Care must be taken that moving an element from one list to another is as effective as possible. The array of TIPS is no longer needed. I can think of at least a couple of ways of doing this:
3. Modify a TIP object to have a reference to a 'previous' TIP object and a 'next' TIP object, so that we have linked lists of TIPS. Then a list is just a reference to the first (and last) TIP object, and a TIP belongs to just one list. If objects are added into the Running, Completed, and Failed lists at the tail only, handling a TIP's status change is O(1), but the lists are no longer sorted by size. Rather they are sorted by when the task ran. Finding the next task is pretty much constant, as we either pick the first TIP in the Virgin list, or the first TIP in the failed list (such that the TIP didn't fail on the host), or we walk through the Running list to find the first speculative task. If you need the lists sorted by size, then insertion is O. This is a fair bit of change, as far as coding is concerned.
4. Lists can also be implemented as arrays of TIPS. This takes more space but moving a TIP from one list to another is faster than in the linked-list case. Insertion into a list sorted by size can be O(log n).
You can also have hybrid approaches where you just keep lists for virgin tasks, for example.
I think it makes most sense to go with Option 1 for now, as it's the easiest to implement and makes the most common case run much faster. Options 3 and 4 need a fair bit of refactoring and may be an overkill for now, since you can get the most bang for the buck by just making sure that you don't scan the array from the beginning for virgin tasks.