Description
This has been investigated in fixed in the Storm-Crawler https://github.com/DigitalPebble/storm-crawler/issues/48.
curl -I "http://www.dailynewslosangeles.com/"
HTTP/1.1 301 Moved Permanently
Location: http://www.dailynews.com
Connection: close
Content-Length: 0
Content-Type: text/html; charset=UTF-8
when fetching with Nutch we are getting a timeout exception :
./nutch parsechecker -D http.agent.name="PebbleCrawler" "http://www.dailynewslosangeles.com/"
fetching: http://www.dailynewslosangeles.com/
Fetch failed with protocol status: exception(16), lastModified=0: java.net.SocketTimeoutException: Read timed out
The reason for this is that we are trying to read from the stream even though we know that the content length is 0.
The patch attached fixes the issue.