Droids
  1. Droids
  2. DROIDS-109

Several defects in robots exclusion protocol (robots.txt) implementation

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.2.0
    • Fix Version/s: None
    • Component/s: core, norobots
    • Labels:
      None

      Description

      1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part)
      2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule
      3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII);
      4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
      5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
      6. Wildcard characters should be recognized
      7. Sitemaps
      8. Crawl rate
      9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

      and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now...

      Some references:
      http://nikitathespider.com/python/rerp/
      http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
      http://www.searchtools.com/robots/robots-txt.html
      http://en.wikipedia.org/wiki/Robots.txt

      Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated...
      Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
      We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard.

      Update from Google:
      http://code.google.com/web/controlcrawlindex/

        Activity

        Hide
        Fuad Efendi added a comment -

        I can work on it now; together with BIXO team, and crawler-commons.
        The problem is InputStreams but I'll try to minimize changes... I'll start with test cases... thanks

        Show
        Fuad Efendi added a comment - I can work on it now; together with BIXO team, and crawler-commons. The problem is InputStreams but I'll try to minimize changes... I'll start with test cases... thanks
        Hide
        Paul Rogalinski added a comment -

        I've successfully ported the BIXO's implementation over to "my version" of droids. Why no patch? Two issues: a) my copy of droids is too far away from the current trunk and b) this patch would IMHO change too much (DroidsHttpClient/Protocol have been altered for instance). Anybody with commit permissions up to the task?

        Show
        Paul Rogalinski added a comment - I've successfully ported the BIXO's implementation over to "my version" of droids. Why no patch? Two issues: a) my copy of droids is too far away from the current trunk and b) this patch would IMHO change too much (DroidsHttpClient/Protocol have been altered for instance). Anybody with commit permissions up to the task?
        Hide
        Otis Gospodnetic added a comment -

        Fuad opened an issue, but won't be providing a patch.
        Should we close this as Won't Fix until this starts itching somebody enough to submit a patch?

        Show
        Otis Gospodnetic added a comment - Fuad opened an issue, but won't be providing a patch. Should we close this as Won't Fix until this starts itching somebody enough to submit a patch?
        Hide
        Fuad Efendi added a comment -

        @Ken:
        BIXO Robot is great, especially spellchecker, and many flavors of "new line" character (which I really encountered few years ago and reported to Nutch).

        @Paul:
        Ken suggested the same, to design test cases; I am simply very limited in time... whenever I feel I need to share findings I do share...

        It's much easier to improve BIXO or crawler-commons than to completely redesign Droids (in order to implement HTTP headers pre-processing in Droids, I need to avoid using InputStream in JavaBean classes, and use bytearrays and metadata instead - it's easier to rewrite Droids from scratch than to submit a patch)

        Show
        Fuad Efendi added a comment - @Ken: BIXO Robot is great, especially spellchecker, and many flavors of "new line" character (which I really encountered few years ago and reported to Nutch). @Paul: Ken suggested the same, to design test cases; I am simply very limited in time... whenever I feel I need to share findings I do share... It's much easier to improve BIXO or crawler-commons than to completely redesign Droids (in order to implement HTTP headers pre-processing in Droids, I need to avoid using InputStream in JavaBean classes, and use bytearrays and metadata instead - it's easier to rewrite Droids from scratch than to submit a patch)
        Hide
        Fuad Efendi added a comment - - edited

        And another project hosted at Google by Google, just documentation:
        http://code.google.com/web/controlcrawlindex/
        For instance, it documents X-Robots-Tags HTTP headers, Punycode (unicode encoding for domain names), and etc.

        Show
        Fuad Efendi added a comment - - edited And another project hosted at Google by Google, just documentation: http://code.google.com/web/controlcrawlindex/ For instance, it documents X-Robots-Tags HTTP headers, Punycode (unicode encoding for domain names), and etc.
        Hide
        Ken Krugler added a comment -

        I'd separately emailed Fuad about crawler-commons, and also pointed him at the current robots.txt parsing code in Bixo. I'd taken all of the code/tests I could find from Nutch, Droids, Heritrix and one other Java-based crawler, and tried to come up with parsing code that passed all tests. Then I ran it against a 2.3M domain crawl, and tried to handle all of the common errors I encountered (typos, missing ':', etc).

        The big remaining issue is handling Google-esque URL patterns.

        Show
        Ken Krugler added a comment - I'd separately emailed Fuad about crawler-commons, and also pointed him at the current robots.txt parsing code in Bixo. I'd taken all of the code/tests I could find from Nutch, Droids, Heritrix and one other Java-based crawler, and tried to come up with parsing code that passed all tests. Then I ran it against a 2.3M domain crawl, and tried to handle all of the common errors I encountered (typos, missing ':', etc). The big remaining issue is handling Google-esque URL patterns.
        Hide
        Thorsten Scherler added a comment -

        Actually I am subscribed to the crawler-common ml and was there when the project had been created. There is not much traffic in that project and it had been created to have some independent ground between Nutch, Droids (basically this two at least that had been my impression) and some others.

        Show
        Thorsten Scherler added a comment - Actually I am subscribed to the crawler-common ml and was there when the project had been created. There is not much traffic in that project and it had been created to have some independent ground between Nutch, Droids (basically this two at least that had been my impression) and some others.
        Hide
        Otis Gospodnetic added a comment -

        Isn't this sort of stuff dealt with in Crawler Commons project? See http://code.google.com/p/crawler-commons/

        Shouldn't Droids make use of the effort and functionality in that project? (n.b. I don't know what the state of the project is or what functionality it actually provides..... just had a quick look - I don't see anything in the repo there about robots.txt handling, but I bet Ken Krugler could tell us about the plans, timelines, and such).

        Show
        Otis Gospodnetic added a comment - Isn't this sort of stuff dealt with in Crawler Commons project? See http://code.google.com/p/crawler-commons/ Shouldn't Droids make use of the effort and functionality in that project? (n.b. I don't know what the state of the project is or what functionality it actually provides..... just had a quick look - I don't see anything in the repo there about robots.txt handling, but I bet Ken Krugler could tell us about the plans, timelines, and such).
        Hide
        Paul Rogalinski added a comment -

        @Fuad:

        can you design some tests for those issues? I understand that designing (j)unit tests for this kind of problems is very time consuming so a bunch of folders, each representing one test-scenario and a description of the expected outcome, would be just fine to start with.

        Currently I am working on a different part of droids, but I will have to deal with robots.txt pretty soon and I would be more than happy to commit an drop-in replacement for the current implementation addressing those issues.

        Show
        Paul Rogalinski added a comment - @Fuad: can you design some tests for those issues? I understand that designing (j)unit tests for this kind of problems is very time consuming so a bunch of folders, each representing one test-scenario and a description of the expected outcome, would be just fine to start with. Currently I am working on a different part of droids, but I will have to deal with robots.txt pretty soon and I would be more than happy to commit an drop-in replacement for the current implementation addressing those issues.
        Hide
        Fuad Efendi added a comment -

        1. I need to introduce "Entity" with HTTP Headers, expiration settings, last retrieval date, response code, exception message, and etc.; I need to decode properly bytearray representing robots.txt
        2. I need to modify some interfaces; so that droids-norobots should use (refactored) HttpContentEntity.
        And we have cyclic loop of dependencies...

        It's better to unite "core" and "norobots" into same package... otherwise we need to move some interfaces from "core" into "norobots" (which doesn't seem nice)

        Show
        Fuad Efendi added a comment - 1. I need to introduce "Entity" with HTTP Headers, expiration settings, last retrieval date, response code, exception message, and etc.; I need to decode properly bytearray representing robots.txt 2. I need to modify some interfaces; so that droids-norobots should use (refactored) HttpContentEntity. And we have cyclic loop of dependencies... It's better to unite "core" and "norobots" into same package... otherwise we need to move some interfaces from "core" into "norobots" (which doesn't seem nice)
        Hide
        Fuad Efendi added a comment - - edited

        http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5):

        The matching process compares every octet in the path portion of
        the URL and the path from the record. If a %xx encoded octet is
        encountered it is unencoded prior to comparison, unless it is the
        "/" character, which has special meaning in a path. The match
        evaluates positively if and only if the end of the path from the
        record is reached before a difference in octets is encountered.

        Koster doesn't write anything about robots.txt encoding/decoding (HTTP response headers). Koster only mentions HTTP cache control in section 3.4...

        Logically, we need to decode path (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character.

        It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/)

        Also, don't forget "path" in this unofficial document (1996) means really whatever is after "protocol+//+host+port"... for instance:
        /query;sessionID=123#My%2fAnchor?abc=123

        Show
        Fuad Efendi added a comment - - edited http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5): The matching process compares every octet in the path portion of the URL and the path from the record. If a %xx encoded octet is encountered it is unencoded prior to comparison, unless it is the "/" character, which has special meaning in a path. The match evaluates positively if and only if the end of the path from the record is reached before a difference in octets is encountered. Koster doesn't write anything about robots.txt encoding/decoding (HTTP response headers). Koster only mentions HTTP cache control in section 3.4... Logically, we need to decode path (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character. It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/ ) Also, don't forget "path" in this unofficial document (1996) means really whatever is after "protocol+//+host+port"... for instance: /query;sessionID=123#My%2fAnchor?abc=123
        Hide
        Fuad Efendi added a comment -

        We need also to deal with HTTP response headers. For instance, to decode into proper charset robots.txt; to deal with expiration header; etc.
        I should modify ContentLoader interface, then implementations, and subsequently break whole framework let's think...

        Show
        Fuad Efendi added a comment - We need also to deal with HTTP response headers. For instance, to decode into proper charset robots.txt; to deal with expiration header; etc. I should modify ContentLoader interface, then implementations, and subsequently break whole framework let's think...

          People

          • Assignee:
            Unassigned
            Reporter:
            Fuad Efendi
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 1,344h
              1,344h
              Remaining:
              Remaining Estimate - 1,344h
              1,344h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development