When running a very large query on a cluster with limited resource, we noticed that one of the node's VM thread freezes the fragment threads as it tries to do some work (GC perhaps?). This is a clear indication that the query is stuck in a weird state where it might not recover from.
Under such circumstances, it makes sense to cancel or atleast warn the user on that page of the query exceeding a certain threshold.
For detecting this, the user will find that the Last Progress column in the Fragments Overview section will show large times.
In addition, there are instances where a query might have buffered operators spilling to disk, which also hits performance (and, subsequently, longer run times). Calling out this skew can be very useful.
Or there might be cases where a single fragment takes much longer than the average (indicated by an extreme skew in the Gantt chart).