Issue Details (XML | Word | Printable)

Key: NUTCH-105
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Unassigned
Reporter: Rod Taylor
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Nutch

Network error during robots.txt fetch causes file to be ignored

Created: 07/Oct/05 12:42 AM   Updated: 24/Sep/06 03:30 PM
Return to search
Component/s: fetcher
Affects Version/s: 0.8, 0.8.1, 0.9.0
Fix Version/s: 0.8.1, 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works RobotRulesParser.patch 2006-08-19 12:36 AM Greg Kim 2 kB

Resolution Date: 19/Sep/06 04:08 PM


 Description  « Hide
Earlier we had a small network glitch which prevented us from retrieving
the robots.txt file for a site we were crawling at the time:

nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
task_m_h02y5t Couldn't get robots.txt for
http://www.japanesetranslator.co.uk/portfolio/:
org.apache.commons.httpclient.ConnectTimeoutException: The host
did not accept the connection within timeout of 10000 ms
nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
task_m_h02y5t Couldn't get robots.txt for
http://www.japanesetranslator.co.uk/translation/:
org.apache.commons.httpclient.ConnectTimeoutException: The host
did not accept the connection within timeout of 10000 ms

Nutch then assumed that because we were unable to retrieve the file due
to network issues, that it didn't exist and we could crawl the entire
website. Nutch then successfully grabbed a few pages which were listed
in the robots.txt as being disallowed.

I think Nutch should continue attempting to retrieve the robots.txt file
until, at very least, we are able to establish a connection to the host,
otherwise the host should be ignored until the next round of fetches.

The webmaster of japanesetranslator.co.uk filed a complaint informing us
of the issue.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Greg Kim added a comment - 19/Aug/06 12:36 AM
This patch will not cache the robots.txt on network errors/delays; currently we cache EMPTY_RULES (allows everything) for a host X on network errors / delays... which potentially becomes a serious problem when the network returns during the same crawl iteration - i.e. nutch will crawl everything on a host X since the EMPTY_RULES got cached from the first robots.txt failed GET attempt (due to network failure, not 404)

Greg Kim made changes - 19/Aug/06 12:36 AM
Field Original Value New Value
Attachment RobotRulesParser.patch [ 12339151 ]
Greg Kim made changes - 19/Aug/06 12:37 AM
Affects Version/s 0.8.1 [ 12312020 ]
Affects Version/s 0.9.0 [ 12312013 ]
Greg Kim added a comment - 23/Aug/06 09:49 PM

Any hope of getting this patch commited? It's a simple fix for a potentially big problem. I've seen the problem multiple times and it evokes great anger among webmasters.

Greg Kim made changes - 23/Aug/06 09:49 PM
Component/s fetcher [ 11591 ]
Priority Major [ 3 ] Critical [ 2 ]
Sami Siren added a comment - 07/Sep/06 02:26 PM
looks ok to me. If there is no objections I'll commit this before 0.8.1

Sami Siren made changes - 07/Sep/06 02:26 PM
Fix Version/s 0.8.1 [ 12312020 ]
Fix Version/s 0.9.0 [ 12312013 ]
Repository Revision Date User Message
ASF #447867 Tue Sep 19 14:52:37 UTC 2006 siren NUTCH-105 - Network error during robots.txt fetch causes file to beignored, contributed by Greg Kim
Files Changed
MODIFY /lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
MODIFY /lucene/nutch/branches/branch-0.8/CHANGES.txt

Repository Revision Date User Message
ASF #447893 Tue Sep 19 16:01:34 UTC 2006 siren NUTCH-105 - Network error during robots.txt fetch causes file to beignored, contributed by Greg Kim
Files Changed
MODIFY /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
MODIFY /lucene/nutch/trunk/CHANGES.txt

Sami Siren added a comment - 19/Sep/06 04:08 PM
This is now committed, thanks!

Sami Siren made changes - 19/Sep/06 04:08 PM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Sami Siren made changes - 24/Sep/06 03:30 PM
Status Resolved [ 5 ] Closed [ 6 ]