Description
When using the protocol-httpclient plugin, the entire contents of the request URL is retrieved, regardless of the http.content.limit configuration setting. (The issue does not affect the protocol-http plugin.)
For very large documents, this leads the Fetcher to believe that the FetcherThread is hung, and the Fetcher aborts its run, logging a warning about hung threads (Fetcher.java:433).
org.apache.nutch.protocol.httpclient.HttpResponse is properly counting the content length, and is breaking its read loop at the proper point.
However, when HttpResponse closes the InputStream from which it is reading, the InputStream object (an org.apache.commons.httpclient.AutoCloseInputStream) continues to read all of the content of the document from the webserver.
Though I'm not certain this is the correct solution, a quick test shows that if HttpResponse is changed to abort the GET, the InputStream correctly aborts the read from the webserver, and the FetcherThread can continue.
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-559 NTLM, Basic and Digest Authentication schemes for web/proxy server
- Closed