|
Carsten Lehmann made changes - 24/Dec/06 01:00 PM
Carsten Lehmann made changes - 24/Dec/06 01:09 PM
Carsten Lehmann made changes - 24/Dec/06 01:09 PM
Some more explanations:
Above I meant http://gso.gbv.de/XYZ I have attached two other log extracts: a) squid_access_log_tail1000.txt this file contains the last 1000 lines of the squid access log. b) last_robots.txt_requests_squidlog.txt this files shows the last requests to that certain robot.txt-url. it might be of concern that near the end of this file the line I ran into this same problem, and spent some time debugging it. Here's what's going on:
Symptom: I was running a fetcher with 400 threads, and found it getting slower and slower as the fetch progressed. A look at the stack over time showed that over time, more and more threads blocked in the same place, in MultiThreadedHttpConnectionManager, waiting for a connection, called via http.getResponse() in getRobotRulesSet(). After hours of running, more than 390 of 400 threads were blocked! A little sleuthing revealed what's going on. This happens to be the one place in the code where we call http.getResponse() directly, instead of via getProtocolOutput, which does its own connection throttling, so we're relying on MultiThreadedHttpConnectionManager's connection throttling. The problem was that MultiThreadedHttpConnectionManager was ignoring the connection timeout, and waiting indefinitely for any other running threads to release a connection to the same site. If there are any large sites which time out on /robots.txt fetches, threads will "pile up" over time, since they're not timing out, and the RobotRules cache doesn't cache timeouts, so every new thread hitting that site will try to fetch /robots.txt, and hang for an increasing amount of time as it has to wait for the (ever-increasing # of) previous threads to give up and relinquish the single connection. We seem to be victims of HttpClient's somewhat byzantine parameter architecture. We do set the timeout, but we set it on the MultiThreadedHttpConnectionManager, which ignores it. If I set the parameter on the HttpClient instead, the problem goes away. I haven't looked at the HttpClient source, but I'm guessing that what happens is that internally, HttpClient is calling MultiThreadedHttpConnectionManager.getConnectionWithTimeout or some such, thereby overriding the connection manager's own timeouts. At any rate, the fix is very straightforward, and a patch follows. I highly recommend this patch, it hugely sped up my crawl, and I suspect it will do so for others with similar configurations and at least one slow or hung large-ish site in the crawl. There are two related issues: one, that the RobotRules cache doesn't keep track of failure cases, so sites that are down will be accessed many, many times; and two, that simultaneous /robots.txt accesses to the same site will all try to access the site. I've got a locally-modified version of getRobotRulesSet() that fixes these two problems as well. It's not perfect, but it makes the fetch both faster and more polite. I'll try to spiffy that up and include it in a future patch, if anyone's interested. Here's a context diff. Hopefully this will work, am rusty at creating patches, and did it outside of my normal development tree, since it's highly divergent from the Nutch trunk.
In any case, it's a one-liner, easy enough to add manually Doug, thank you for your excellent analysis. I'll apply the patch before 1.0 - however, the poor handling of robots.txt needs to be addressed later (in 1.1).
Fixed in rev. 749247. Thank you!
Andrzej Bialecki made changes - 02/Mar/09 09:12 AM
Integrated in Nutch-trunk #742 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/742/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
a) no entries in the log file between 22.51h and 23.02h, at 23.02h the fetch is aborted.
)
b) after the fetch is aborted, the stacktraces show different urls (not http://XYZ.gso.gbv.de
but this is what seems to be fetched, according to the last requests in the squid log (see other attachment)