[DROIDS-109] Several defects in robots exclusion protocol (robots.txt) implementation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.2.0
Fix Version/s: None
Component/s: core, norobots
Labels:
None

Description

1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now...

Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

Update from Google:
http://code.google.com/web/controlcrawlindex/

Attachments

Sub-Tasks

droids-norobots shouldn't have dependency on protocol implementation; it should be abstract Rules Engine

Open

Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Fuad Efendi

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 30/Nov/10 19:39

Updated:: 03/Dec/11 01:14

Time Tracking

Estimated:

1,344h

Remaining:

1,344h

Logged:

Not Specified

Include sub-tasks