[NUTCH-481] http.content.limit is broken in the protocol-httpclient plugin - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 1.0.0
Component/s: fetcher
Labels:
None

Description

When using the protocol-httpclient plugin, the entire contents of the request URL is retrieved, regardless of the http.content.limit configuration setting. (The issue does not affect the protocol-http plugin.)

For very large documents, this leads the Fetcher to believe that the FetcherThread is hung, and the Fetcher aborts its run, logging a warning about hung threads (Fetcher.java:433).

org.apache.nutch.protocol.httpclient.HttpResponse is properly counting the content length, and is breaking its read loop at the proper point.

However, when HttpResponse closes the InputStream from which it is reading, the InputStream object (an org.apache.commons.httpclient.AutoCloseInputStream) continues to read all of the content of the document from the webserver.

Though I'm not certain this is the correct solution, a quick test shows that if HttpResponse is changed to abort the GET, the InputStream correctly aborts the read from the webserver, and the FetcherThread can continue.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

abortatcontentlimit.patch
11/May/07 18:39
0.8 kB
charlie wanek

Issue Links

relates to

NUTCH-559 NTLM, Basic and Digest Authentication schemes for web/proxy server

Closed

Activity

People

Assignee:: Dogacan Guney

Reporter:: charlie wanek

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/May/07 18:23

Updated:: 10/Apr/09 12:29

Resolved:: 04/Jan/08 19:51