Solr
  1. Solr
  2. SOLR-4283

Improve URL decoding (followup of SOLR-4265)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Followup of SOLR-4265:
      SOLR-4265 has 2 problems:

      • it reads the whole InputStream into a String and this one can be big. This wastes memory, especially when your query string from the POSted form data is near the 2 Megabyte limit. The String is then packed in splitted form into a big Map.
      • it does not report corrupt UTF-8

      The attached patch will do 2 things:

      • The decoding of the POSTed form data is done on the ServletInputStream, directly parsing the bytes (not chars). Key/Value pairs are extracted and %-decoded to byte[] on the fly. URL-parameters from getQueryString() are parsed with the same code using ByteArrayInputStream on the original String, interpreted as UTF-8 (this is a hack, because Servlet API does not give back the original bytes from the HTTP request). To be standards conform, the query String should be interpreted as US-ASCII, but with this approach, not full escaped UTF-8 from the HTTP request survive.
      • the byte[] key/value pairs are converted to Strings using CharsetDecoder

      This will be memory efficient and will report incorrect escaped form data, so people will no longer complain if searches hit no results or similar.

      1. index.jsp
        0.3 kB
        Dawid Weiss
      2. request.http
        0.4 kB
        Dawid Weiss
      3. SOLR-4283.patch
        17 kB
        Uwe Schindler
      4. SOLR-4283.patch
        16 kB
        Uwe Schindler
      5. SOLR-4283.patch
        15 kB
        Uwe Schindler
      6. SOLR-4283.patch
        13 kB
        Uwe Schindler
      7. SOLR-4283.patch
        11 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          More tests, disallow missing key

          Show
          Uwe Schindler added a comment - More tests, disallow missing key
          Hide
          Dawid Weiss added a comment -
          +        final InputStream in = new ByteArrayInputStream(queryString.getBytes(IOUtils.CHARSET_UTF_8));
          

          I think if you pass raw unescaped UTF-8 in a HTTP message the query String (decoded object) will be actually a sequence of char[] with bytes corresponding to the original bytes in the HTTP header, not properly decoded UTF-8 so you'd be double-decoding UTF-8 here. I assume the container parses uris using byte-identity codepage (US-ASCII). It's probably worth checking with a netcat-prepared HTTP message to see what they actually do.

          I think it'd be more sensible to decode char[] into byte[] with masking 0xff (and possibly throw an exception if something is non-zero after ~0xff.

          Show
          Dawid Weiss added a comment - + final InputStream in = new ByteArrayInputStream(queryString.getBytes(IOUtils.CHARSET_UTF_8)); I think if you pass raw unescaped UTF-8 in a HTTP message the query String (decoded object) will be actually a sequence of char[] with bytes corresponding to the original bytes in the HTTP header, not properly decoded UTF-8 so you'd be double-decoding UTF-8 here. I assume the container parses uris using byte-identity codepage (US-ASCII). It's probably worth checking with a netcat-prepared HTTP message to see what they actually do. I think it'd be more sensible to decode char[] into byte[] with masking 0xff (and possibly throw an exception if something is non-zero after ~0xff.
          Hide
          Uwe Schindler added a comment -

          Dawid: I mentioned that:

          To be standards conform, the query String should be interpreted as US-ASCII, but with this approach, not full escaped UTF-8 from the HTTP request survive.

          Jetty uses uses UTF-8 to parse HTTP requests. This approach is to make at least some request with unescaped UTF-8 bytes survive. If I mask away the lower bits i just loose information. For the URL-decoding it does not matter, so I leave it there. This was already discussed with Yonik in SOLR-4265.

          Uwe

          Show
          Uwe Schindler added a comment - Dawid: I mentioned that: To be standards conform, the query String should be interpreted as US-ASCII, but with this approach, not full escaped UTF-8 from the HTTP request survive. Jetty uses uses UTF-8 to parse HTTP requests. This approach is to make at least some request with unescaped UTF-8 bytes survive. If I mask away the lower bits i just loose information. For the URL-decoding it does not matter, so I leave it there. This was already discussed with Yonik in SOLR-4265 . Uwe
          Hide
          Dawid Weiss added a comment -

          Jetty uses uses UTF-8 to parse HTTP requests.

          I don't think it should (or does); if it does though, it'd be weird because it should treat HTTP headers with US-ASCII and not interpret anything.

          I did the following experiment. I prepared a raw HTTP request with an "ł" character inside (Unicode U+0141, UTF-8 sequence: C5 82). Then I prepared the following simple JSP (servlet):

          <%
            response.setContentType("text/plain; charset=UTF-8");
            String qs = request.getQueryString();
            if (qs == null) qs = "";
            for (char chr : qs.toCharArray()) {
              out.print(Integer.toHexString(chr));
              out.print(" ");
            }
            out.println();
          %>
          

          As you can see what it does is it hexdumps all the char values from the query string, as provided by the container. It is unfortunate that the container HAS to provide a String here because it needs to do some sort of input bytes->chars conversion. My opinion is that this should be an identity mapping of some sort; copying bytes to chars (with 0xff mask) will give you a rough equivalent of doing new String(inputQueryStringBytes, "ISO8859-1"). This string cannot be converted back to bytes with qs.getBytes("UTF-8") because these bytes are no longer the same characters (we don't know what they were, actually). Even if we assume they were proper UTF-8 they will no longer be converted as such.

          This is Tomcat, for example:

          $ nc6 localhost 8080 < request.http
          nc6: using stream socket
          HTTP/1.1 200 OK
          Server: Apache-Coyote/1.1
          Set-Cookie: JSESSIONID=EFBEE4830C7B91D979938E82C9CAB6CB; Path=/test/; HttpOnly
          Content-Type: text/plain;charset=UTF-8
          Content-Length: 30
          Date: Tue, 08 Jan 2013 08:35:41 GMT
          
          61 61 c5 82 62 62
          

          This follows my expectation of a HTTP server – it didn't corrupt anything, it didn't interpret anything because it couldn't.

          Now jetty9:

          $ nc6 localhost 8080 < request.http
          nc6: using stream socket
          HTTP/1.1 200 OK
          Date: Tue, 08 Jan 2013 08:42:49 GMT
          Set-Cookie: JSESSIONID=1d4i8p9coo32y11wo09leai3iw;Path=/
          Expires: Thu, 01 Jan 1970 00:00:00 GMT
          Content-Type: text/plain; charset=UTF-8
          Content-Length: 24
          Server: Jetty(9.0.0.M3)
          
          61 61 3f 62 62
          

          This is wrong; 3f stands for '?' so clearly Jetty tries to interpret the input bytes from HTTP somehow and fails on doing so. What it does exactly – I've no idea, would have to check the code.

          With Jetty8, interestingly:

          $ nc6 localhost 8080 < request.http
          nc6: using stream socket
          HTTP/1.1 200 OK
          Date: Tue, 08 Jan 2013 08:57:56 GMT
          Set-Cookie: JSESSIONID=12ivf935hblkk1452cmwpgntt9;Path=/
          Expires: Thu, 01 Jan 1970 00:00:00 GMT
          Content-Type: text/plain;charset=UTF-8
          Content-Length: 26
          Server: Jetty(8.1.8.v20121106)
          
          61 61 142 62 62
          aałbb
          

          Ha! So here's where Yonik claim that Jetty parses these UTF-8 URIs right came from...

          I honestly think this is a BUG in Jetty8 though (with a regression to an even worse bug in Jetty9...). So I sustain my claim that the conversion:

          queryString.getBytes("UTF-8")
          

          to recover the input byte stream is incorrect. A fix? There doesn't seem to be one that would work for all the containers (for tomcat it'd be going back char-to-byte one by one and then parsing as UTF-8 for example, for Jetty8 it's already in UTF-8, for Jetty9 it's screwed up from the beginning).

          Show
          Dawid Weiss added a comment - Jetty uses uses UTF-8 to parse HTTP requests. I don't think it should (or does); if it does though, it'd be weird because it should treat HTTP headers with US-ASCII and not interpret anything. I did the following experiment. I prepared a raw HTTP request with an "ł" character inside (Unicode U+0141, UTF-8 sequence: C5 82). Then I prepared the following simple JSP (servlet): <% response.setContentType( "text/plain; charset=UTF-8" ); String qs = request.getQueryString(); if (qs == null ) qs = ""; for ( char chr : qs.toCharArray()) { out.print( Integer .toHexString(chr)); out.print( " " ); } out.println(); %> As you can see what it does is it hexdumps all the char values from the query string, as provided by the container. It is unfortunate that the container HAS to provide a String here because it needs to do some sort of input bytes->chars conversion. My opinion is that this should be an identity mapping of some sort; copying bytes to chars (with 0xff mask) will give you a rough equivalent of doing new String(inputQueryStringBytes, "ISO8859-1"). This string cannot be converted back to bytes with qs.getBytes("UTF-8") because these bytes are no longer the same characters (we don't know what they were, actually). Even if we assume they were proper UTF-8 they will no longer be converted as such. This is Tomcat, for example: $ nc6 localhost 8080 < request.http nc6: using stream socket HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Set-Cookie: JSESSIONID=EFBEE4830C7B91D979938E82C9CAB6CB; Path=/test/; HttpOnly Content-Type: text/plain;charset=UTF-8 Content-Length: 30 Date: Tue, 08 Jan 2013 08:35:41 GMT 61 61 c5 82 62 62 This follows my expectation of a HTTP server – it didn't corrupt anything, it didn't interpret anything because it couldn't. Now jetty9: $ nc6 localhost 8080 < request.http nc6: using stream socket HTTP/1.1 200 OK Date: Tue, 08 Jan 2013 08:42:49 GMT Set-Cookie: JSESSIONID=1d4i8p9coo32y11wo09leai3iw;Path=/ Expires: Thu, 01 Jan 1970 00:00:00 GMT Content-Type: text/plain; charset=UTF-8 Content-Length: 24 Server: Jetty(9.0.0.M3) 61 61 3f 62 62 This is wrong; 3f stands for '?' so clearly Jetty tries to interpret the input bytes from HTTP somehow and fails on doing so. What it does exactly – I've no idea, would have to check the code. With Jetty8, interestingly: $ nc6 localhost 8080 < request.http nc6: using stream socket HTTP/1.1 200 OK Date: Tue, 08 Jan 2013 08:57:56 GMT Set-Cookie: JSESSIONID=12ivf935hblkk1452cmwpgntt9;Path=/ Expires: Thu, 01 Jan 1970 00:00:00 GMT Content-Type: text/plain;charset=UTF-8 Content-Length: 26 Server: Jetty(8.1.8.v20121106) 61 61 142 62 62 aałbb Ha! So here's where Yonik claim that Jetty parses these UTF-8 URIs right came from... I honestly think this is a BUG in Jetty8 though (with a regression to an even worse bug in Jetty9...). So I sustain my claim that the conversion: queryString.getBytes( "UTF-8" ) to recover the input byte stream is incorrect. A fix? There doesn't seem to be one that would work for all the containers (for tomcat it'd be going back char-to-byte one by one and then parsing as UTF-8 for example, for Jetty8 it's already in UTF-8, for Jetty9 it's screwed up from the beginning).
          Hide
          Dawid Weiss added a comment -

          Test files.

          Show
          Dawid Weiss added a comment - Test files.
          Hide
          Dawid Weiss added a comment -

          https://github.com/eclipse/jetty.project/commit/31def062146702f03da8d6f2e7010b7d4586439b

          this is a fairly recent commit (15 days ago). You can see they're still assuming UTF-8 for URIs. I used jetty9M3 in my tests above, don't know if it works with jetty9M5 (probably, given the above commit).

          Show
          Dawid Weiss added a comment - https://github.com/eclipse/jetty.project/commit/31def062146702f03da8d6f2e7010b7d4586439b this is a fairly recent commit (15 days ago). You can see they're still assuming UTF-8 for URIs. I used jetty9M3 in my tests above, don't know if it works with jetty9M5 (probably, given the above commit).
          Hide
          Uwe Schindler added a comment -

          Attached is a patch that fails cleanly on the query string if it contains bytes > 127.

          This is the most portable approach and helps users because it explicitely says: encode your query string

          Show
          Uwe Schindler added a comment - Attached is a patch that fails cleanly on the query string if it contains bytes > 127. This is the most portable approach and helps users because it explicitely says: encode your query string
          Hide
          Uwe Schindler added a comment -

          Fix test bug found by this! Yipee!

          Show
          Uwe Schindler added a comment - Fix test bug found by this! Yipee!
          Hide
          Uwe Schindler added a comment -

          More readable and helpful Exception message on BAD_REQUEST HTTP response.

          I think it is ready to commit. We should try this out and find out how this affects users with incorrect applications.

          Show
          Uwe Schindler added a comment - More readable and helpful Exception message on BAD_REQUEST HTTP response. I think it is ready to commit. We should try this out and find out how this affects users with incorrect applications.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1430396

          SOLR-4283: Improvements for URL-Decoding

          Show
          Commit Tag Bot added a comment - [trunk commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1430396 SOLR-4283 : Improvements for URL-Decoding
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1430397

          Merged revision(s) 1430396 from lucene/dev/trunk:
          SOLR-4283: Improvements for URL-Decoding

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1430397 Merged revision(s) 1430396 from lucene/dev/trunk: SOLR-4283 : Improvements for URL-Decoding
          Hide
          Uwe Schindler added a comment -

          Committed to trunk and 4.x.

          A next step would be to make the encoding of the GET-URLs configureable (using the defacto standard "&ie=charset" URL parameter, as used by most REST webservices of major search engines).

          Show
          Uwe Schindler added a comment - Committed to trunk and 4.x. A next step would be to make the encoding of the GET-URLs configureable (using the defacto standard "&ie=charset" URL parameter, as used by most REST webservices of major search engines).

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development