I was able to reproduce this bug on a version of Impala pre-
IMPALA-4953 with the attached patch that adds a sleep. The patch is a hack and only works on my system (it has a name hardcoded). The trick is to kill the third impala manually while the cluster is starting up.
Then the system gets stuck in a state where all impalads thing 22002 is alive but the process was actually killed. Running queries fails because they keep getting scheduled on the dead impalad.
The race seems quite exotic but may be possible if there are intermittent transport errors (causing heartbeats to fail) or if there are delays processing topics, e.g. contending for locks.
IMPALA-4953 fixes the problem by deleting newly-added transient entries if the subscriber got unregistered while the statestore was processing an update.