If an agent reports a build as failing, it could get put back in the queue and tried again. We might want to put in a header that says how many times it's been re-queued so that the queue doesn't need to be manually flushed on a regular basis. Something like 2 or three times is probably good. Definitely no more than 5.
Maybe eventually we want to add something in so that if a build does fail, the machine it ran on doesn't pick it back up. We'd need some jms selectors and a message property for that. The usefulness of that feature is that if a particular agent is actually badly configured and failing all tests sent it's way, it won't matter quite so much and won't be a weak link in the system. The builds it couldn't run will go back into the queue and tried again on another machine. That machine could fail at running the build too, so we should aggregate a list of machines that tried it and not just remember the last one. The algorithm for quitting could be once the max tries have been exceeded or the build attempt has circled around to all the agents in the system.
It might become apparent with enough failed attempts that the agents that fail are the same in some way (same os, same java version, same database, etc) and we could use that data to find platform specific issues. This would just be a "finger in the wind" compared to actually trying the same build task on each platform. But since this is a feature we need anyway, it's a fairly cheap finger in the wind.