Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.0.0
-
None
-
None
-
Nutch 1.0.0, Windows XP, Java 1.6.0_17
Description
If there is a Modified time stored in the crawldb for a link, the class org.apache.nutch.protocol.http.HttpResponse will use it as the value for the If-Modified-Since header.
Line 131:
reqStr.append("\r\n");
if (datum.getModifiedTime() > 0) {
reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getModifiedTime()));
reqStr.append("\r\n");
}
The problem is that an extra blank line is insert before this header. This make the header invalid:
----------------------------------------------------------------------------------
GET /tinysite/second.html HTTP/1.0
Host: localhost:8080
Accept-Encoding: x-gzip, gzip, deflate
User-Agent: nutch/Nutch-1.0
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
If-Modified-Since: Tue, 27 Apr 2010 13:51:50 GMT
----------------------------------------------------------------------------------
I'm using the AdaptiveFetchSchedule to set the Modified time in the crawldb.
I've made a test by moving the line 131 after the if block and it works. I think this is where that line should go.