Jim, I don't believe this issue should be closed. It may not have anything to do with our code, but it affects us in a very significant way so we need to get to the bottom of it. This is an exploratory issue, hopefully with a solution, but we're not done with it yet.
Your conclusion is not correct. You are writing off the delay as context switching that occurs when client is on the same machine. First of all, those context switches are orders of magnitude below the timings of these queries. The queries in question run 40 times slower (seems to be something about 4000ms, dunno what) when running local to the hosting node. This amount of time is clearly not explainable by the additional context switching of having these two things running concurrently.
But more importantly, this explanation does not address what we're seeing. Given what you say above, we should be seeing ALL queries running slower by some fixed factor when running on the same node. But we don't. There is a very specific and definable range of payload sizes for which this extra delay of ~4 seconds exists. The 7 column case and the 1000 column case both perform nearly identical in both situations, so the affect of the context switching is negligible.
Have you done network-level debugging? We need to figure out where in the chain the delay is introduced and go from there.
There could be an issue in Linux, RPC, who knows... but we should keep digging whether or not we figure this out for 0.20