Our cluster has been running slowly since I got speculation working, I looked into it and noticed that stderr was saying some tasks were taking almost an hour to run even though in the application logs on the nodes that task only took a minute or so to run. Digging into the thread dump for the master node I noticed a number of threads are blocked, apparently by speculation thread. At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while it looks through the tasks to see what needs to be rerun. Unfortunately that code loops through each of the tasks, so when you have even just a couple hundred thousand tasks to run that can be prohibitively slow to run inside of a synchronized block. Once I disabled speculation, the job went back to having acceptable performance.
There are no comments around that lock indicating why it was added, and the git history seems to have a couple refactorings so its hard to find where it was added. I'm tempted to believe it is the result of someone assuming that an extra synchronized block never hurt anyone (in reality I've probably just as many bugs caused by over synchronization as too little) as it looks too broad to be actually guarding any potential concurrency issue. But, since concurrency issues can be tricky to reproduce (and yes, I understand that's an extreme understatement) I'm not sure just blindly removing it without being familiar with the history is necessarily safe.
Can someone look into this? Or at least make a note in the documentation that speculation should not be used with large clusters?