CouchDB
  1. CouchDB
  2. COUCHDB-883

Wrong document returned due to incorrect URL decoding

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.0.1
    • Fix Version/s: None
    • Component/s: HTTP Interface
    • Labels:
      None
    • Environment:

      Kubuntu 10.4, Firefox 3.6.8

    • Skill Level:
      Committers Level (Medium to Hard)

      Description

      I have two documents in my database: "a b" and "a+b". The first can be retrieved via "/mydb/a%20b" and the second via "/mydb/a%2Bb".

      When I enter "/mydb/a b" in the browser it automatically encodes it so the correct document is returned. But when I enter "/mydb/a+b" the URL is sent intact since "+" is a valid character in a path segment according to [1]. The problem is that "GET /mydb/a+b" makes CouchDB return the document with id "a b" and not the intended one, which is against the URI spec .

      For an informal description of URL encoding one may refer to [2].

      [1]: http://www.ietf.org/rfc/rfc2396.txt
      [2]: http://www.lunatech-research.com/archives/2009/02/03/what-every-web-developer-must-know-about-url-encoding

      1. logging.diff
        3 kB
        Muharem Hrnjadovic

        Activity

        Paul Joseph Davis made changes -
        Skill Level Committers Level (Medium to Hard)
        Muharem Hrnjadovic made changes -
        Attachment logging.diff [ 12454408 ]
        Hide
        Muharem Hrnjadovic added a comment -

        I added some logging statements to find out where the a+b -> a b conversion takes place and came to realize that it happens in handle_request() (src/couchdb/couch_httpd.erl, line 237) after the 'requested_path_parts' and 'path_parts' are mangled through couch_httpd:unquote() which in turn calls mochiweb_util:unquote().

        A quick experiment confirms that:

        $ erl -pz $HOME/src/couchdb/src/mochiweb
        Erlang R14A (erts-5.8) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

        Eshell V5.8 (abort with ^G)
        1> mochiweb_util:unquote("a+b")
        1> .
        "a b"
        2>

        Show
        Muharem Hrnjadovic added a comment - I added some logging statements to find out where the a+b -> a b conversion takes place and came to realize that it happens in handle_request() (src/couchdb/couch_httpd.erl, line 237) after the 'requested_path_parts' and 'path_parts' are mangled through couch_httpd:unquote() which in turn calls mochiweb_util:unquote(). A quick experiment confirms that: $ erl -pz $HOME/src/couchdb/src/mochiweb Erlang R14A (erts-5.8) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false] Eshell V5.8 (abort with ^G) 1> mochiweb_util:unquote("a+b") 1> . "a b" 2>
        Hide
        Taras Puchko added a comment -

        Sebastian, "reserved" does NOT mean that a character must be encoded in all parts of a URL.

        2.2. Reserved Characters
        Characters in the "reserved" set are not reserved in all contexts. The set of characters actually reserved within any given URI component is defined by that component. In general, a character is reserved if the semantics of the URI changes if the character is replaced with its escaped US-ASCII encoding.

        3.3. Path Component
        segment = *pchar *( ";" param )
        pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" | "$" | ","
        The path may consist of a sequence of path segments separated by a single slash "/" character.
        Within a path segment, the characters "/", ";", "=", and "?" are reserved.

        Show
        Taras Puchko added a comment - Sebastian, "reserved" does NOT mean that a character must be encoded in all parts of a URL. 2.2. Reserved Characters Characters in the "reserved" set are not reserved in all contexts. The set of characters actually reserved within any given URI component is defined by that component. In general, a character is reserved if the semantics of the URI changes if the character is replaced with its escaped US-ASCII encoding. 3.3. Path Component segment = *pchar *( ";" param ) pchar = unreserved | escaped | ":" | "@" | "&" | "=" | "+" | "$" | "," The path may consist of a sequence of path segments separated by a single slash "/" character. Within a path segment, the characters "/", ";", "=", and "?" are reserved.
        Hide
        Sebastian Cohnen added a comment -

        RFC2396: G.2. Modifications from both RFC 1738 and RFC 1808

        "The plus "+", dollar "$", and comma "," characters have been added to those in the "reserved" set, since they are treated as reserved within the query component."

        Therefor you need to URI-encode the plus character according to the RFC.

        Show
        Sebastian Cohnen added a comment - RFC2396: G.2. Modifications from both RFC 1738 and RFC 1808 "The plus "+", dollar "$", and comma "," characters have been added to those in the "reserved" set, since they are treated as reserved within the query component." Therefor you need to URI-encode the plus character according to the RFC.
        Hide
        Muharem Hrnjadovic added a comment -

        FWIW, an URL like http://localhost/a+b is left alone by apache2 i.e. I see the following entry in /var/log/apache2/access.log:

        127.0.0.1 - - [10/Sep/2010:05:13:30 +0200] "GET /a+b HTTP/1.1" 200 294 "-"

        Also, a file with that name (a+b) is served correctly.

        Show
        Muharem Hrnjadovic added a comment - FWIW, an URL like http://localhost/a+b is left alone by apache2 i.e. I see the following entry in /var/log/apache2/access.log: 127.0.0.1 - - [10/Sep/2010:05:13:30 +0200] "GET /a+b HTTP/1.1" 200 294 "-" Also, a file with that name (a+b) is served correctly.
        Taras Puchko made changes -
        Resolution Not A Problem [ 8 ]
        Status Closed [ 6 ] Reopened [ 4 ]
        Hide
        Taras Puchko added a comment -

        Robert, you are wrong. What "uri escaping rules" are you talking about?

        I've specifically pointed to the spec. Read "3.3. Path Component" and "2.2. Reserved Characters".

        There is no rule that makes a plus sign be interpreted as a space. It's a compatibility behavior applicable ONLY to query parameter values.

        Please read http://www.lunatech-research.com/archives/2009/02/03/what-every-web-developer-must-know-about-url-encoding

        Show
        Taras Puchko added a comment - Robert, you are wrong. What "uri escaping rules" are you talking about? I've specifically pointed to the spec. Read "3.3. Path Component" and "2.2. Reserved Characters". There is no rule that makes a plus sign be interpreted as a space. It's a compatibility behavior applicable ONLY to query parameter values. Please read http://www.lunatech-research.com/archives/2009/02/03/what-every-web-developer-must-know-about-url-encoding
        Robert Newson made changes -
        Field Original Value New Value
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Not A Problem [ 8 ]
        Hide
        Robert Newson added a comment -

        This is expected behavior. The + is interpreted as a space according to the uri escaping rules. Use %2b if you want to keep the + symbol.

        Show
        Robert Newson added a comment - This is expected behavior. The + is interpreted as a space according to the uri escaping rules. Use %2b if you want to keep the + symbol.
        Taras Puchko created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Taras Puchko
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development