Issue Details (XML | Word | Printable)

Key: NUTCH-419
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Carsten Lehmann
Votes: 1
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Nutch

unavailable robots.txt kills fetch

Created: 24/Dec/06 12:45 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8.1
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works diffs 2009-02-28 07:19 PM Doug Cook 0.8 kB
Text File Licensed for inclusion in ASF works last_robots.txt_requests_squidlog.txt 2006-12-24 01:09 PM Carsten Lehmann 19 kB
Text File Licensed for inclusion in ASF works nutch-log.txt 2006-12-24 01:00 PM Carsten Lehmann 4 kB
Text File Licensed for inclusion in ASF works squid_access_log_tail1000.txt 2006-12-24 01:09 PM Carsten Lehmann 141 kB
Environment:
Fetcher is behind a squid proxy, but I am pretty sure this is irrelevant.
Nutch in local mode, running on a linux machine with 2GB RAM.

Resolution Date: 02/Mar/09 09:12 AM


 Description  « Hide
I think there is another robots.txt-related problem which is not
adressed by NUTCH-344,
but also results in an aborted fetch.

I am sure that in my last fetch all 17 fetcher threads died
while they were waiting for a robots.txt-file to be delivered by a not
properly responding web server.

I looked at the squid access log, which is used by all fetch threads.
It ends with many HTTP-504-errors ("gateway timeout") caused by a
certain robots.txt url:

<....>
1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html

These entries mean that it takes 15 minutes before the request ends
with a timeout.
This can be calculated from the squid log, the first column is the
request time (in UTC seconds), the second column is the duration of
the request (in ms):
900000/1000/60=15 minutes.

As far as I understand it, every time a fetch thread tries to get this
robots.txt-file the thread busy waits for the duration of the request
(15 minutes).
If this is right, then all 17 fetcher threads were caught in this trap
at the time when fetching was aborted, as there are 17 requests in
the squid log which did not timeout before the message "aborting with
17 threads" was written to the nutch-logfile.

Setting fetcher.max.crawl.delay can not help here.
I see 296 access attempts in total concerning this robots.txt-url in
the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Carsten Lehmann added a comment - 24/Dec/06 01:00 PM
Log extract from hadoop.log. Interesting are two points:

a) no entries in the log file between 22.51h and 23.02h, at 23.02h the fetch is aborted.
b) after the fetch is aborted, the stacktraces show different urls (not http://XYZ.gso.gbv.de)
but this is what seems to be fetched, according to the last requests in the squid log (see other attachment)


Carsten Lehmann added a comment - 24/Dec/06 01:24 PM
Some more explanations:

Above I meant http://gso.gbv.de/XYZ, not http://XYZ.gso.gbv.de of course.

I have attached two other log extracts:

a) squid_access_log_tail1000.txt

this file contains the last 1000 lines of the squid access log.
it shows what the fetcher has actually been doing before the fetch gets aborted.
It ends with a number of requests to that certain robots.txt-url.

b) last_robots.txt_requests_squidlog.txt

this files shows the last requests to that certain robot.txt-url.

it might be of concern that near the end of this file the line
1166652145.652 1042451 127.0.0.1 TCP_MISS/504 1450 GET http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
repeats 14 times.
this means that there have been 14 simultaenous requests to this url,
right?
are requests to the robots.txt-file not included in "fetcher.server.delay", which is set to "2.0" in
my configuration?
anyway, this seems to be ill behaviour.


Doug Cook added a comment - 28/Feb/09 07:04 PM
I ran into this same problem, and spent some time debugging it. Here's what's going on:

Symptom:
Slow sites, or sites that hang, are causing Fetcher threads to hang for very long periods of time.

I was running a fetcher with 400 threads, and found it getting slower and slower as the fetch progressed. A look at the stack over time showed that over time, more and more threads blocked in the same place, in MultiThreadedHttpConnectionManager, waiting for a connection, called via http.getResponse() in getRobotRulesSet(). After hours of running, more than 390 of 400 threads were blocked!

A little sleuthing revealed what's going on. This happens to be the one place in the code where we call http.getResponse() directly, instead of via getProtocolOutput, which does its own connection throttling, so we're relying on MultiThreadedHttpConnectionManager's connection throttling. The problem was that MultiThreadedHttpConnectionManager was ignoring the connection timeout, and waiting indefinitely for any other running threads to release a connection to the same site. If there are any large sites which time out on /robots.txt fetches, threads will "pile up" over time, since they're not timing out, and the RobotRules cache doesn't cache timeouts, so every new thread hitting that site will try to fetch /robots.txt, and hang for an increasing amount of time as it has to wait for the (ever-increasing # of) previous threads to give up and relinquish the single connection.

We seem to be victims of HttpClient's somewhat byzantine parameter architecture. We do set the timeout, but we set it on the MultiThreadedHttpConnectionManager, which ignores it. If I set the parameter on the HttpClient instead, the problem goes away. I haven't looked at the HttpClient source, but I'm guessing that what happens is that internally, HttpClient is calling MultiThreadedHttpConnectionManager.getConnectionWithTimeout or some such, thereby overriding the connection manager's own timeouts. At any rate, the fix is very straightforward, and a patch follows.

I highly recommend this patch, it hugely sped up my crawl, and I suspect it will do so for others with similar configurations and at least one slow or hung large-ish site in the crawl.

There are two related issues: one, that the RobotRules cache doesn't keep track of failure cases, so sites that are down will be accessed many, many times; and two, that simultaneous /robots.txt accesses to the same site will all try to access the site. I've got a locally-modified version of getRobotRulesSet() that fixes these two problems as well. It's not perfect, but it makes the fetch both faster and more polite. I'll try to spiffy that up and include it in a future patch, if anyone's interested.


Doug Cook added a comment - 28/Feb/09 07:19 PM
Here's a context diff. Hopefully this will work, am rusty at creating patches, and did it outside of my normal development tree, since it's highly divergent from the Nutch trunk.

In any case, it's a one-liner, easy enough to add manually


Andrzej Bialecki added a comment - 01/Mar/09 07:31 AM
Doug, thank you for your excellent analysis. I'll apply the patch before 1.0 - however, the poor handling of robots.txt needs to be addressed later (in 1.1).

Andrzej Bialecki added a comment - 02/Mar/09 09:12 AM
Fixed in rev. 749247. Thank you!

Hudson added a comment - 04/Mar/09 04:18 AM