Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Invalid
-
0.6, 0.7, 0.7.1, 0.8
-
None
-
None
-
Nutch: Windows XP, J2SE 1.4.2_09
Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
Description
1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
Please note:
Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
I'll add more comments after finishing tests...