|
I think this is, in general, something that we won't fix. It might be possible to improve things, but we cannot, without elaborate handshake protocols, guarantee that RPC responses are received by clients. As Owen has indicated, we must instead make our applications tolerant of that.
Note that this is generally a problem for HTTP-based services too. When someone hits "stop" in their browser for a slow request, the server generally continues to compute the request and only discovers that the connection has been closed when it attempts to write the response, if at all. If the client times out after the server has flushed the response, then there's no way for the server to know this. You sugguest that we might queue requests on the client so that only a single request to a particular server is outstanding at a time. That would not work well for distributed search (Nutch's original IPC application). In distributed search a front end typically has many queries outstanding. Each query is broadcast to a number of back end servers. Different queries take different amounts of time. We do not want to make fast queries wait for slow queries, as that would make all queries slow and increase the burden on the front end servers. Would folks object if I resolve this as a WONTFIX bug? Well the problem I was seeing is that a getFile RPC request for say 1GB was issued, but then the Call object timedout on the client. Yet the 1GB was still transferred fully to the client and then discarded since there was no waiting Call object. My system got into a situation where 1000s of getFile requests were queued up per node in the server's tcp receive buffers. So you can see how no progress would have ever been made.
Are you suggesting that this is not going to be a problem because all the large response body RPCs will now be done over HTTP? I can't see how leaving the code as it is would be fine, unless all RPCs are going to be very quick to service and have small response bodies. You mentioned that search queries is an example of an RPC that could be broadcast out. What would happen if the queries were taking too long to service and the client side rpc request was already timing out. I could see it get into a similar situation where the servers would become busy processing stale query requests. So if all the RPCs will be small response bodies, then it seems fine to keep the connection always open and just read in the response and throw it away. And then why not add a CANCEL-RPC request type that can get sent over whenever the client request has timed out? Doug, We realize all of this will change with all the great copy/sort-path work being done at Yahoo, but here's what we're seeing: When our individual map output files get large, like >1GB, we get endless cascading timeouts. Naveen found the actual cause of this:
So what happens is, you get an endless cycle of timed-out requests, that are eventually satisfied, causing useless transfers of >1GB, which cause more timeouts, etc. Obviously, the new efforts at Yahoo will change this considerably, for example not multiplexing these transfers within a single TCP session will allow termination of the TCP session on a timeout, which will prevent large, useless transfers from happening. But I just wanted to explain what we were seeing, and why it is definitely uncool to proceed with large transfers when nobody will use the results. We'll try out the new Yahoo code, and see what we get. I suspect that this will help a lot. I'm going to hijack this bug. Clearly the original context was fixed by moving from the rpc getMapOutput to a jetty servlet. However, we are seeing cases where the dfs servers have trouble keeping up with the rpc calls.
Therefore, I propose that we define a fraction of the ipc.timeout that is the maximum time the rpc calls can take before they are given to the handler. This patch has the rpc server handlers discard any call that is older than 60% of the ipc.timeout.
I worry about letting the queued calls grow without bound. If a synchronized server implementation spends a long time on a single request, while lots of other requests are coming in from other clients, then the queue could end up exhausting memory. So perhaps we should discard stale requests as new requests are queued rather than when they're dequeued, so that we can limit the size of the queue?
+1 for Dougs comment
The thread populating the call queue should add calls to the end of the queue. Then examine calls at the front for staleness and discard calls that are too old. This patch adds a fixed size limit to the number of calls in the rpc server's queue. If the queue is full, the oldest is discarded to make room for the new one.
I just committed this. Thanks, Owen!
This was included in the 0.7.0 release, but mistakenly marked for 0.8.0.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The second part is that this behavior happens on all of the rpc's, not just map output transfer. However, sending a message now to potentially prevent a future message is not necessarily a winning game. If you did send a "cancel call" message, you'd probably want to remove the call from the server's work queue if it is still waiting to be processed.
It is tempting to try a strategy where you check the age of a call when you start processing it on the server and reject messages that are too old, but the problem is that it is not 10 seconds from the start of the call, but rather 10 seconds with no data received on the socket, which is hard for the server to estimate.
The final part is that this characteristic that the server can not assume that the return message from the rpc call was received is a problem. For example, I had a problem with pollForNewTask timing out and dropping tasks. I fixed that by adding a timeout so that after a task is assigned to a task tracker, it it does not show up in a task tracker status message within 10 minutes it is considered lost. However, this applies to all of the rpc messages. You always need to make sure that if the return value of the rpc call were to disappear into thin air that the problem would be detected eventually. There are other instances of this ind of problem that still exist in the code that need to be identified and fixed.