Stumbled on this behaviour by mistake, essentially if you issue a remote query and call close() on the QueryEngineHTTP prior to having consumed all the results then your application can hang until all the data is consumed from the response stream.
This behaviour is caused by Apache HTTP Client which assumes it can re-use connections but in order to do so first needs to have consumed the previous response so it will sit in a tight loop until it has done this. Note that this won't always happen because HTTP Client will inspect various aspects of the response to decide whether it can re-use the connection. However unless certain conditions are met HTTP Client will default to the connection re-use behaviour.
This is obviously bad for the user because if they've told us to close the execution then clearly they want us to dispose of it and carry out ASAP
It also causes issues for the server because rather than dropping the connection HTTP Client continues to read from the server so the server may also be stuck in a semi-hung state doing a lot of work that the actual user is never going to see.
Steps to reproduce:
- Start up Fuseki
- Run a simple Jena app that creates a query that will take a long time (e.g. a large cross product) and issue it to Fuseki, then call close() on the QueryExecution
You should observe that the Jena app hangs until Fuseki reports the query as completed. If you log the current time before and after calling close() you should see a large delay (assuming a sufficiently long running query).
There are several possible solutions that come to mind:
- Upgrade to newer HTTP Client and hope it does not have the behaviour
- Disable connection re-use when providing our own HTTP client
- When we know we will shut down the client (and thus re-use is irrelevant) terminate the client first rather than closing the connection first
1 is likely to be problematic because APIs have changed significantly and there are dependency conflicts with other modules such as jean-text. Also I do not expect that newer versions will have changed their behaviour in this regard so it would be ineffective anyway
2 is intrusive but effective
3 may actually be the best option because this does not need us to explicitly configure a connection re-use strategy rather it allows us to simply kill off the client which we were potentially going to do anyway (unless the user customised the HTTP Client being used) which kills the connections without having to first consume the response and by killing the connection we should also abort the work on the server side because it should notice the dropped connection and stop trying to calculate and send further results.