There is some logic in the stress test that tries to guess what a reasonable timeout for a query is. There are enough fudge factors that the false positive rate is fairly low, but it also doesn't provide much useful coverage unless a query is stuck. But an overall job timeout achieves the same thing.
Some specific issues that the current logic has (and which are tricky to solve):
- The number of concurrent queries is calculated at query submission time. E.g. a query that starts before a large batch of other queries is submitted will be given a short timeout multiplier.
- There is no guarantee that performance degrades linearly. E.g. if runtime filters arrive late, we can see much larger perf hits.
We should consider removing the timeout enforcement or at least revisit it.