Issue Details (XML | Word | Printable)

Key: NUTCH-419
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Carsten Lehmann
Votes: 1
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Nutch

unavailable robots.txt kills fetch

Created: 24/Dec/06 12:45 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8.1
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works diffs 2009-02-28 07:19 PM Doug Cook 0.8 kB
Text File Licensed for inclusion in ASF works last_robots.txt_requests_squidlog.txt 2006-12-24 01:09 PM Carsten Lehmann 19 kB
Text File Licensed for inclusion in ASF works nutch-log.txt 2006-12-24 01:00 PM Carsten Lehmann 4 kB
Text File Licensed for inclusion in ASF works squid_access_log_tail1000.txt 2006-12-24 01:09 PM Carsten Lehmann 141 kB
Environment:
Fetcher is behind a squid proxy, but I am pretty sure this is irrelevant.
Nutch in local mode, running on a linux machine with 2GB RAM.

Resolution Date: 02/Mar/09 09:12 AM


 Description  « Hide
I think there is another robots.txt-related problem which is not
adressed by NUTCH-344,
but also results in an aborted fetch.

I am sure that in my last fetch all 17 fetcher threads died
while they were waiting for a robots.txt-file to be delivered by a not
properly responding web server.

I looked at the squid access log, which is used by all fetch threads.
It ends with many HTTP-504-errors ("gateway timeout") caused by a
certain robots.txt url:

<....>
1166652253.332 899427 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
1166652343.350 899664 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html
1166652353.560 899871 127.0.0.1 TCP_MISS/504 1450 GET
http://gso.gbv.de/robots.txt - DIRECT/193.174.240.8 text/html

These entries mean that it takes 15 minutes before the request ends
with a timeout.
This can be calculated from the squid log, the first column is the
request time (in UTC seconds), the second column is the duration of
the request (in ms):
900000/1000/60=15 minutes.

As far as I understand it, every time a fetch thread tries to get this
robots.txt-file the thread busy waits for the duration of the request
(15 minutes).
If this is right, then all 17 fetcher threads were caught in this trap
at the time when fetching was aborted, as there are 17 requests in
the squid log which did not timeout before the message "aborting with
17 threads" was written to the nutch-logfile.

Setting fetcher.max.crawl.delay can not help here.
I see 296 access attempts in total concerning this robots.txt-url in
the squid log of this crawl, but fetcher.max.crawl.delay is set to 30.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
No work has yet been logged on this issue.