[NUTCH-2549] protocol-http does not behave the same as browsers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.14
Fix Version/s: 1.15
Component/s: None
Labels:
None

Flags:

Important

Description

We identified the following issues in protocol-http (a plugin implementing the HTTP protocol):

It fails if an url's path does not start with '/'
- Example: http://news.fx678.com?171 (browsers correctly rewrite the url as http://news.fx678.com/?171, while nutch tries to send an invalid HTTP request starting with GET ?171 HTTP/1.0.
It advertises its requests as being HTTP/1.0, but sends an Accept-Encoding request header, that is defined only in HTTP/1.1. This confuses some web servers
- Example: http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm
If a server sends a redirection (3XX status code, with a Location header), protocol-http tries to parse the HTTP response body anyway. Thus, if an error occurs while decoding the body, the redirection is not followed and the information is lost. Browsers follow the redirection and close the socket soon as they can.
- Example: http://www.webarcelona.net/es/blog?page=2
Some servers invalidly send an HTTP body directly without a status line or headers. Browsers handle that, protocol-http doesn't:
- Example: https://app.unitymedia.de/
Some servers invalidly add colons after the HTTP status code in the status line (they can send HTTP/1.1 404: Not found instead of HTTP/1.1 404 Not found for instance). Browsers can handle that.
Some servers invalidly send headers that span over multiple lines. In that case, browsers simply ignore the subsequent lines, but protocol-http throws an error, thus preventing us from fetching the contents of the page.
There is no limit over the size of the HTTP headers it reads. A bogus server could send an infinite stream of different HTTP headers and cause the fetcher to go out of memory, or send the same HTTP header repeatedly and cause the fetcher to timeout.
The same goes for the HTTP status line: no check is made concerning its size.
While reading chunked content, if the content size becomes larger than http.getMaxContent(), instead of just stopping, it tries to read a new chunk before having read the previous one completely, resulting in a 'bad chunk length' error.

Additionally (and that concerns protocol-httpclient as well), when reading http headers, for each header, the SpellCheckedMetadata class computes a Levenshtein distance between it and every known header in the HttpHeaders interface. Not only is that slow, non-standard, and non-conform to browsers' behavior, but it also causes bugs and prevents us from accessing the real headers sent by the HTTP server.

Example: http://www.taz.de/!443358/ . The server sends a Client-Transfer-Encoding: chunked header, but SpellCheckedMetadata corrects it to Transfer-Encoding: chunked. Then, HttpResponse (in protocol-http) tries to read the HTTP body as chunked, whereas it is not.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-2549.patch
24/May/18 12:28
9 kB
Gerard Bouchar

Issue Links

links to

GitHub Pull Request #347

Sub-Tasks

1.	URL normalization problem: path not starting with a '/'	Closed	Unassigned
2.	protocol-http makes invalid HTTP/1.0 requests	Closed	Unassigned
3.	protocol-http fails to follow redirections when an HTTP response body is invalid	Closed	Unassigned
4.	protocol-http cannot handle a missing HTTP status line	Closed	Unassigned
5.	protocol-http cannot handle colons after the HTTP status code	Closed	Unassigned
6.	protocol-http throws an error when an http header spans over multiple lines	Closed	Unassigned
7.	protocol-http can be made to read arbitrarily large HTTP responses	Closed	Unassigned
8.	protocol-http fails to read large chunked HTTP responses	Closed	Unassigned
9.	HTTP header spellchecking issues	Closed	Unassigned
10.	protocol-http throws an error when the content-length header is not a number	Closed	Unassigned
11.	protocol-http does not respect the maximum content-size for chunked responses	Closed	Unassigned

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Gerard Bouchar

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Apr/18 14:39

Updated:: 01/Oct/19 14:29

Resolved:: 12/Jun/18 19:15