The membership mechanism used to tell Impala servers about failures does not always detect fast crash-restarts. If a server restarts and re-registers before the state-store recognises that it has failed, the failure won't get reported to any other subscriber.
The right way to fix this, I think, is to track a version number in every subscriber. When a subscriber reconnects, it gets a new version number. For every query, we track the highest version number of the subscriber known at that time. Then if any backend executing a query has a higher version number, it's likely to have restarted since the query started. There might be a couple of false positives, since a node could conceivably restart between a scheduling assignment and actually receiving a query, but that's unlikely and better than false negatives.