While performing scale testing, detailed profiling of Go test clients showed that >95% of the execution time can be devoted to the cgo call. The issues seems to be related on sends to the NewMessage() call. For receives, the bottleneck is both NewMessage() and the call to actually receive the message.
This behavior is not unexpected as CGO is a well-known bottleneck. Would it be possible to have a NewMessage() call that return multiple messages and a recv call that took an "At most" argument. i.e. recv(10) would receive 10 or fewer messages that might be waiting in the queue. Also, it would be nice to be able to trade latency for throughput in that the callback wasn't triggered until N messages were recieved (with timeout)....