Uploaded image for project: 'Droids'
  1. Droids
  2. DROIDS-109

Several defects in robots exclusion protocol (robots.txt) implementation

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.2.0
    • Fix Version/s: None
    • Component/s: core, norobots
    • Labels:
      None

      Description

      1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
      2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
      3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
      4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
      5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
      6. Wildcard characters should be recognized
      7. Sitemaps
      8. Crawl rate
      9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

      and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now...

      Some references:
      http://nikitathespider.com/python/rerp/
      http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
      http://www.searchtools.com/robots/robots-txt.html
      http://en.wikipedia.org/wiki/Robots.txt

      Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
      Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
      We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

      Update from Google:
      http://code.google.com/web/controlcrawlindex/

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              funtick Fuad Efendi
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 1,344h
                1,344h
                Remaining:
                Remaining Estimate - 1,344h
                1,344h
                Logged:
                Time Spent - Not Specified
                Not Specified