Solr
  1. Solr
  2. SOLR-4265

Fix decoding of GET/POST parameters for servlet containers with non-UTF-8 URL parsing (Tomcat)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: web gui
    • Labels:
      None
    • Environment:

      Windows but, environment independent

      Description

      When you type an accent (in french language for example) in the console query tester, there's no charset conversion (servlet request charset conversion)
      Eg.: "même" is converted into it's ISO-8859-1 representation ==> fail

      The reason : getCharacterEncoding from HTTPRequest is not tested. Il it's null, il will assume to convert an UTF-8 encoding charset.

      1. CropperCapture[4].png
        41 kB
        Dawid Weiss
      2. CropperCapture[5].png
        33 kB
        Dawid Weiss
      3. CropperCapture[6].png
        32 kB
        Dawid Weiss
      4. SOLR-4265.patch
        44 kB
        Uwe Schindler
      5. SOLR-4265.patch
        33 kB
        Uwe Schindler
      6. SOLR-4265.patch
        32 kB
        Uwe Schindler
      7. SOLR-4265.patch
        32 kB
        Uwe Schindler
      8. SOLR-4265.patch
        18 kB
        Uwe Schindler
      9. SOLR-4265.patch
        16 kB
        Uwe Schindler
      10. SolrDispatchFilter.java.patch
        0.6 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Alex Rocher added a comment -

          For me the patch is simple: 2 sources have been modified.

          org.apache.solr.common.params
          public MultiMapSolrParams(Map<String,String[]> map,String charset) {
          this.map =map;
          // If not null, assume that the job is ever done.
          if (charset ==null) {
          // Update values of parameter in regard to charset encoding.
          // By default, if charset encoding is not known we assume
          // that it's UTF-8 encoded from HTTP client (mainly browser).
          // Question : this rule is true in Servlet params, the code may no be here ?
          for (Entry<String, String[]> entry : map.entrySet()) {
          String[] values =entry.getValue();
          if (values !=null) {
          for (int i = 0; i < values.length; i++) {
          try

          { // Charset.XXX is better for performances. values[i] =new String(values[i].getBytes("ISO-8859-1"), "UTF-8"); }

          catch (UnsupportedEncodingException e)

          { // FIXME : what to do in this case ? Should never occurs ? }

          }
          // Refresh entry : remark, cannot create a new Map for unknown reason, have to update the current one.
          entry.setValue(values);
          }
          }
          }
          }

          and

          /**
          *
          */
          public class ServletSolrParams extends MultiMapSolrParams {
          public ServletSolrParams(ServletRequest req)

          { super(req.getParameterMap(), req.getCharacterEncoding()); }

          I think that my patch should only impact HTTP layer.

          Show
          Alex Rocher added a comment - For me the patch is simple: 2 sources have been modified. org.apache.solr.common.params public MultiMapSolrParams(Map<String,String[]> map,String charset) { this.map =map; // If not null, assume that the job is ever done. if (charset ==null) { // Update values of parameter in regard to charset encoding. // By default, if charset encoding is not known we assume // that it's UTF-8 encoded from HTTP client (mainly browser). // Question : this rule is true in Servlet params, the code may no be here ? for (Entry<String, String[]> entry : map.entrySet()) { String[] values =entry.getValue(); if (values !=null) { for (int i = 0; i < values.length; i++) { try { // Charset.XXX is better for performances. values[i] =new String(values[i].getBytes("ISO-8859-1"), "UTF-8"); } catch (UnsupportedEncodingException e) { // FIXME : what to do in this case ? Should never occurs ? } } // Refresh entry : remark, cannot create a new Map for unknown reason, have to update the current one. entry.setValue(values); } } } } and /** * */ public class ServletSolrParams extends MultiMapSolrParams { public ServletSolrParams(ServletRequest req) { super(req.getParameterMap(), req.getCharacterEncoding()); } I think that my patch should only impact HTTP layer.
          Hide
          Dawid Weiss added a comment -

          This isn't the right approach. The double conversion from iso8859-1 to UTF-8 is simply wrong – you assume identity conversion to bytes but in fact the characters may be screwed up already by that time.

          This is a container configuration issue. Jetty (or Tomcat, or whatever else) should be configured to decode servlet parameters using UTF-8. For Tomcat the instructions are here:

          http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8

          If you find a similar configuration setting for Jetty it should be applied for the default distribution.

          Show
          Dawid Weiss added a comment - This isn't the right approach. The double conversion from iso8859-1 to UTF-8 is simply wrong – you assume identity conversion to bytes but in fact the characters may be screwed up already by that time. This is a container configuration issue. Jetty (or Tomcat, or whatever else) should be configured to decode servlet parameters using UTF-8. For Tomcat the instructions are here: http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 If you find a similar configuration setting for Jetty it should be applied for the default distribution.
          Hide
          Uwe Schindler added a comment -

          Hi,
          I never understood why you have to set the parameter encoding in your servlet container. Although this is somehow "undocumented", you can use ServletResponse.setCharacterEncoding() to set the encoding of your serlet input. Unfortunately this sets also the encoding for the request body, but thats irrelevant for us, because we use InputStream only.

          The attached patch simply does this on SolrDispatchFilter's startup of GET method (important: before any parameter is read).

          I am just not sure if this works with ServletFilters, but with conventional Servlets this works 100% and used in all my servlets I ever wrote. I never need to configure my Jetty or Sun Java System Webserver - no test on Tomcat.

          Can somebody try this and report back?

          Show
          Uwe Schindler added a comment - Hi, I never understood why you have to set the parameter encoding in your servlet container. Although this is somehow "undocumented", you can use ServletResponse.setCharacterEncoding() to set the encoding of your serlet input. Unfortunately this sets also the encoding for the request body, but thats irrelevant for us, because we use InputStream only. The attached patch simply does this on SolrDispatchFilter's startup of GET method (important: before any parameter is read ). I am just not sure if this works with ServletFilters, but with conventional Servlets this works 100% and used in all my servlets I ever wrote. I never need to configure my Jetty or Sun Java System Webserver - no test on Tomcat. Can somebody try this and report back?
          Hide
          Dawid Weiss added a comment -

          I am just not sure if this works with ServletFilters, but with conventional Servlets this works 100% and used in all my servlets I ever wrote.

          I can guarantee you that this won't always work (there were/are servlet containers that parse parameters eagerly in which case that call is ignored). You can double check in the J2EE spec if you want but I'm pretty sure there are no guarantees about when URI parsing/ parameter decoding takes place.

          Show
          Dawid Weiss added a comment - I am just not sure if this works with ServletFilters, but with conventional Servlets this works 100% and used in all my servlets I ever wrote. I can guarantee you that this won't always work (there were/are servlet containers that parse parameters eagerly in which case that call is ignored). You can double check in the J2EE spec if you want but I'm pretty sure there are no guarantees about when URI parsing/ parameter decoding takes place.
          Hide
          Dawid Weiss added a comment -

          https://issues.apache.org/bugzilla/show_bug.cgi?id=23929

          This thread has some background info on the issue. I don't want to go into the debate on what's the proper behavior; like I said though – I'm confident it's not a reliable method to enforce proper encoding of query parameters (passed via GET/ URI string).

          Show
          Dawid Weiss added a comment - https://issues.apache.org/bugzilla/show_bug.cgi?id=23929 This thread has some background info on the issue. I don't want to go into the debate on what's the proper behavior; like I said though – I'm confident it's not a reliable method to enforce proper encoding of query parameters (passed via GET/ URI string).
          Hide
          Dawid Weiss added a comment -

          Also, see this one:
          http://download.oracle.com/otn-pub/jcp/servlet-3.0-mrel-full-oth-JSpec/servlet-3_0-mrel-spec.pdf?AuthParam=1357328692_53ed082343de48588685d44e37c09e7f

          section 3.10 talks about setCharacterEncoding. This is for forcing HTTP body decoding (for form-encoded POSTs), not URI/headers.

          Show
          Dawid Weiss added a comment - Also, see this one: http://download.oracle.com/otn-pub/jcp/servlet-3.0-mrel-full-oth-JSpec/servlet-3_0-mrel-spec.pdf?AuthParam=1357328692_53ed082343de48588685d44e37c09e7f section 3.10 talks about setCharacterEncoding. This is for forcing HTTP body decoding (for form-encoded POSTs), not URI/headers.
          Hide
          Uwe Schindler added a comment -

          Dawid: I know this problem, but for some servlet containers this brings the correct effect (at least for jetty and SJSWS/iPlanet). Another idea is to let the request parameters be and write an own decoder for HttpServletRequest.getQueryString(). Then you can enforce encoding as you like. Thats what I also did in the past to make it portable around servlet containers.
          A decoder for form-encoded query-strings is very simple.

          Show
          Uwe Schindler added a comment - Dawid: I know this problem, but for some servlet containers this brings the correct effect (at least for jetty and SJSWS/iPlanet). Another idea is to let the request parameters be and write an own decoder for HttpServletRequest.getQueryString(). Then you can enforce encoding as you like. Thats what I also did in the past to make it portable around servlet containers. A decoder for form-encoded query-strings is very simple.
          Hide
          Uwe Schindler added a comment -

          Of course after doing this, you know what to add to forbidden APIS?

          Show
          Uwe Schindler added a comment - Of course after doing this, you know what to add to forbidden APIS?
          Hide
          Alex Rocher added a comment -

          OKA guy's, I'm pretty sure on my facts : when you want to use HTTP protocol trough Servlets, the character encoding (when null !) should have a default value.
          My proposal is for HTTP protocol, and nothing else.

          You can test my testcase, there's really a problem.

          ISO-8859-1 is the major test case but for me, at this time, sol'y is not UTF-8 compliant.

          Show
          Alex Rocher added a comment - OKA guy's, I'm pretty sure on my facts : when you want to use HTTP protocol trough Servlets, the character encoding (when null !) should have a default value. My proposal is for HTTP protocol, and nothing else. You can test my testcase, there's really a problem. ISO-8859-1 is the major test case but for me, at this time, sol'y is not UTF-8 compliant.
          Hide
          Uwe Schindler added a comment - - edited

          Alex: Solr expects all URL parameters encoded as UTF-8 - PERIOD. The problem we are discussing about here is that some servlet containers use ISO-8859-1 to decode the parameters, so although you pass UTF-8-URL-encoded values (e.g. your example would be "q=m%C3%AAme") the servlet container may not use UTF-8 to decode the %-encoded parts. This causes the issue you have seen. And this is currently a configuration issue (in Tomcat you have to change connector), in Jetty you have to set the body encoding (sorry, Dawid, in Jetty this works definitely).

          The HTTP protocol by itsself has nothing to do with this. The whole issue is about the request URI and the decoding of the URL parameters (URLDecorder java class).

          My proposal to fix this in a portable way (like we did with the InputStreams/OutputStreams instead of using Readers/Writers to prevent the buggy Jetty Readers/Writers)): For POST requests, let us set the body encoding (as demonstrated in the patch) to UTF-8. And for the GET-parameters lets decode them manually. Its just a series of String.split() and URLDecoder.decode(..., "UTF-8")

          Show
          Uwe Schindler added a comment - - edited Alex: Solr expects all URL parameters encoded as UTF-8 - PERIOD. The problem we are discussing about here is that some servlet containers use ISO-8859-1 to decode the parameters, so although you pass UTF-8-URL-encoded values (e.g. your example would be "q=m%C3%AAme") the servlet container may not use UTF-8 to decode the %-encoded parts. This causes the issue you have seen. And this is currently a configuration issue (in Tomcat you have to change connector), in Jetty you have to set the body encoding (sorry, Dawid, in Jetty this works definitely). The HTTP protocol by itsself has nothing to do with this. The whole issue is about the request URI and the decoding of the URL parameters (URLDecorder java class). My proposal to fix this in a portable way (like we did with the InputStreams/OutputStreams instead of using Readers/Writers to prevent the buggy Jetty Readers/Writers)): For POST requests, let us set the body encoding (as demonstrated in the patch) to UTF-8. And for the GET-parameters lets decode them manually. Its just a series of String.split() and URLDecoder.decode(..., "UTF-8")
          Hide
          Dawid Weiss added a comment -

          I've checked on Tomcat and setting request encoding doesn't work (you need to specify either useBodyEncodingForURI or an explicit encoding in the connector config).

          Anyway, I like Uwe's idea – parse the query string manually. This could be split to only cover GETs, setCharacterEncoding will be fine for POSTs.... or just handle both since POSTs with form-encoding are pretty much identical to GETs only placed in the request body.

          Alex: this has nothing to do with HTTP, it's how servlet specification is formulated. HTTP encoding says one thing (character encoding should be sent from the browser along with forms, for example) but browsers do another (they usually don't send anything). The default encoding for decoding URIs in the servlet spec. is US-ASCII I believe (a subset of ISO-8859-1); you'd need to double check though, I may be wrong here.

          Show
          Dawid Weiss added a comment - I've checked on Tomcat and setting request encoding doesn't work (you need to specify either useBodyEncodingForURI or an explicit encoding in the connector config). Anyway, I like Uwe's idea – parse the query string manually. This could be split to only cover GETs, setCharacterEncoding will be fine for POSTs.... or just handle both since POSTs with form-encoding are pretty much identical to GETs only placed in the request body. Alex: this has nothing to do with HTTP, it's how servlet specification is formulated. HTTP encoding says one thing (character encoding should be sent from the browser along with forms, for example) but browsers do another (they usually don't send anything). The default encoding for decoding URIs in the servlet spec. is US-ASCII I believe (a subset of ISO-8859-1); you'd need to double check though, I may be wrong here.
          Hide
          Dawid Weiss added a comment -

          (sorry, Dawid, in Jetty this works definitely).

          This isn't clearly explained in the spec and hence the confusion I guess. The setCharacterEncoding method's documentation is pretty clear that it applies to POSTs only, not URIs. But any open interpretation area like this will be prone to different implementations. It's a pity they just didn't include URI character encoding scheme as part of web application's deployment descriptor – then you'd be in control of what you want it to be.

          Show
          Dawid Weiss added a comment - (sorry, Dawid, in Jetty this works definitely). This isn't clearly explained in the spec and hence the confusion I guess. The setCharacterEncoding method's documentation is pretty clear that it applies to POSTs only, not URIs. But any open interpretation area like this will be prone to different implementations. It's a pity they just didn't include URI character encoding scheme as part of web application's deployment descriptor – then you'd be in control of what you want it to be.
          Hide
          Alex Rocher added a comment -

          Funny subject I Think ... finally

          Für mich, the problem is quite simple : when you retrieve a "null" from a Servlet component for request.getCharacterEncoding(), you'll to suppose that you're HTTP client send in a chartset. By default, we can imagine that it's UTF-8.
          I've never found any Servlet/rule/http/browser on this this subjetc. My experience telle me : no encoding, please encode me from UTF-8 :-/

          PS: sorry for my english, I'm french and as you we have a pity translation education

          Show
          Alex Rocher added a comment - Funny subject I Think ... finally Für mich, the problem is quite simple : when you retrieve a "null" from a Servlet component for request.getCharacterEncoding(), you'll to suppose that you're HTTP client send in a chartset. By default, we can imagine that it's UTF-8. I've never found any Servlet/rule/http/browser on this this subjetc. My experience telle me : no encoding, please encode me from UTF-8 :-/ PS: sorry for my english, I'm french and as you we have a pity translation education
          Hide
          Alex Rocher added a comment -

          In all case, there's is a problem. I've tested through 2 JEE container and the problematic is the same (Oracle + Tomcat).

          Thanks for your responses

          Show
          Alex Rocher added a comment - In all case, there's is a problem. I've tested through 2 JEE container and the problematic is the same (Oracle + Tomcat). Thanks for your responses
          Hide
          Yonik Seeley added a comment -

          When you type an accent (in french language for example) in the console query tester, there's no charset conversion (servlet request charset conversion)
          Eg.: "même" is converted into it's ISO-8859-1 representation ==> fail

          I thought we had all this UTF8 stuff nailed long ago... I just tried it out on the new admin console and it seems to be working fine (although I just tried Chrome on OS-X).

          I assume what you meant by "console query tester" is
          http://localhost:8983/solr/#/collection1/query

          I tried cutting and pasting in héllo (hello with an accent over the e), and it worked fine. As did même. Can you provide more detailed instructions on how to reproduce what you are seeing?

          Show
          Yonik Seeley added a comment - When you type an accent (in french language for example) in the console query tester, there's no charset conversion (servlet request charset conversion) Eg.: "même" is converted into it's ISO-8859-1 representation ==> fail I thought we had all this UTF8 stuff nailed long ago... I just tried it out on the new admin console and it seems to be working fine (although I just tried Chrome on OS-X). I assume what you meant by "console query tester" is http://localhost:8983/solr/#/collection1/query I tried cutting and pasting in héllo (hello with an accent over the e), and it worked fine. As did même. Can you provide more detailed instructions on how to reproduce what you are seeing?
          Hide
          Yonik Seeley added a comment - - edited

          In all case, there's is a problem. I've tested through 2 JEE container and the problematic is the same (Oracle + Tomcat).

          Ah, there ya go. That explains some things. Tomcat at least doesn't do UTF8 encoding by default in the URI (as is required by the standards).

          Can you reproduce your bug with the stock jetty server included in the Solr distribution?

          Show
          Yonik Seeley added a comment - - edited In all case, there's is a problem. I've tested through 2 JEE container and the problematic is the same (Oracle + Tomcat). Ah, there ya go. That explains some things. Tomcat at least doesn't do UTF8 encoding by default in the URI (as is required by the standards). Can you reproduce your bug with the stock jetty server included in the Solr distribution?
          Hide
          Yonik Seeley added a comment -

          The default encoding for decoding URIs in the servlet spec. is US-ASCII I believe (a subset of ISO-8859-1); you'd need to double check though, I may be wrong here.

          Yeah, it's actually "double encoded"... it's UTF8 encoded and then percent encoded, which maps any characters outside of the US-ASCII range into percent encoded UTF-8 bytes.
          http://www.ietf.org/rfc/rfc3986.txt

          Show
          Yonik Seeley added a comment - The default encoding for decoding URIs in the servlet spec. is US-ASCII I believe (a subset of ISO-8859-1); you'd need to double check though, I may be wrong here. Yeah, it's actually "double encoded"... it's UTF8 encoded and then percent encoded, which maps any characters outside of the US-ASCII range into percent encoded UTF-8 bytes. http://www.ietf.org/rfc/rfc3986.txt
          Hide
          Uwe Schindler added a comment -

          Yonik,
          this is a similar problem like the broken Reader/Writer implementations we had in Jetty, that corrupted UTF-8. We no longer user ServletRequest.getReader() nor ServletResponse.getWriter() as the definition waht they do is too vague and the implementations in e.g. Jetty (but also Tomcat) are broken. The solution to this is to ignore this addon-functionality in the servlet container and use the InputStream and OutputStream only (this is what I did in an earlier issue).
          We should do the same here to get invariant from configuration directives outside of Solr's influence. The solution to this is to get the query string and decode it (undo &, =, percent encoding) with a URLDecoder configured to use UTF-8. I am currently looking into doing this. When we fixed this, SolrDispatchFilter and the other servlets is container invariant.

          Show
          Uwe Schindler added a comment - Yonik, this is a similar problem like the broken Reader/Writer implementations we had in Jetty, that corrupted UTF-8. We no longer user ServletRequest.getReader() nor ServletResponse.getWriter() as the definition waht they do is too vague and the implementations in e.g. Jetty (but also Tomcat) are broken. The solution to this is to ignore this addon-functionality in the servlet container and use the InputStream and OutputStream only (this is what I did in an earlier issue). We should do the same here to get invariant from configuration directives outside of Solr's influence. The solution to this is to get the query string and decode it (undo &, =, percent encoding) with a URLDecoder configured to use UTF-8. I am currently looking into doing this. When we fixed this, SolrDispatchFilter and the other servlets is container invariant.
          Hide
          Yonik Seeley added a comment -

          The solution to this is to get the query string and decode it (undo &, =, percent encoding)

          Makes sense, I just wasn't sure there was always a way to get the raw string before the container tried to do some decoding for you.

          IIRC, we added some code a long time ago to decode request parameters on the URI in certain circumstances.
          See SolrRequestParsers.parseQueryString

          Show
          Yonik Seeley added a comment - The solution to this is to get the query string and decode it (undo &, =, percent encoding) Makes sense, I just wasn't sure there was always a way to get the raw string before the container tried to do some decoding for you. IIRC, we added some code a long time ago to decode request parameters on the URI in certain circumstances. See SolrRequestParsers.parseQueryString
          Hide
          Uwe Schindler added a comment -

          Makes sense, I just wasn't sure there was always a way to get the raw string before the container tried to do some decoding for you.

          getQueryString() does not do any decoding, the part after the ? in the request URL is returned unmodified. And this one contains only ASCII, if the URL is correct, because the HTTP protocol only knows ASCII in the protocol part (warning: this differes from URI definition, but URI != URL).

          See SolrRequestParsers.parseQueryString

          Great, we should use this also for the request parameters in getQueryString()! Then you can put the Solr WAR file into Tomcat and it would just work(tm) without configuration issues.

          Show
          Uwe Schindler added a comment - Makes sense, I just wasn't sure there was always a way to get the raw string before the container tried to do some decoding for you. getQueryString() does not do any decoding, the part after the ? in the request URL is returned unmodified. And this one contains only ASCII, if the URL is correct, because the HTTP protocol only knows ASCII in the protocol part (warning: this differes from URI definition, but URI != URL). See SolrRequestParsers.parseQueryString Great, we should use this also for the request parameters in getQueryString()! Then you can put the Solr WAR file into Tomcat and it would just work(tm) without configuration issues.
          Hide
          Yonik Seeley added a comment -

          getQueryString() does not do any decoding, the part after the ? in the request URL is returned unmodified. And this one contains only ASCII, if the URL is correct, because the HTTP protocol only knows ASCII

          It's also nice to be able to handle bare UTF-8 (i.e. not necessarily percent encoded if outside of the ASCII range). I've run across one browser that does this, and curl itself - for example:
          curl 'http://localhost:8983/solr/query?q=héllo'

          So if the container is still decoding the bytes into a String before we get it (and assumes latin-1), we still can't fix that part. It would still an improvement to be able to handle percent encoded UTF-8 by default though, and the bare-UTF8 should still work for containers configured correctly (like Jetty by default).

          Show
          Yonik Seeley added a comment - getQueryString() does not do any decoding, the part after the ? in the request URL is returned unmodified. And this one contains only ASCII, if the URL is correct, because the HTTP protocol only knows ASCII It's also nice to be able to handle bare UTF-8 (i.e. not necessarily percent encoded if outside of the ASCII range). I've run across one browser that does this, and curl itself - for example: curl 'http://localhost:8983/solr/query?q=héllo' So if the container is still decoding the bytes into a String before we get it (and assumes latin-1), we still can't fix that part. It would still an improvement to be able to handle percent encoded UTF-8 by default though, and the bare-UTF8 should still work for containers configured correctly (like Jetty by default).
          Hide
          Uwe Schindler added a comment -

          It's also nice to be able to handle bare UTF-8 (i.e. not necessarily percent encoded if outside of the ASCII range). I've run across one browser that does this, and curl itself - for example:

          curl 'http://localhost:8983/solr/query?q=héllo'

          That's an invalid HTTP request. Browsers generally automatically transform this and URLEncode before sending to the server - look into the access.log.

          Show
          Uwe Schindler added a comment - It's also nice to be able to handle bare UTF-8 (i.e. not necessarily percent encoded if outside of the ASCII range). I've run across one browser that does this, and curl itself - for example: curl 'http://localhost:8983/solr/query?q=héllo' That's an invalid HTTP request. Browsers generally automatically transform this and URLEncode before sending to the server - look into the access.log.
          Hide
          Uwe Schindler added a comment -

          I started here and added a new forbidden API file for Solr:

          #  Licensed to the Apache Software Foundation (ASF) under one or more
          #  contributor license agreements.  See the NOTICE file distributed with
          #  this work for additional information regarding copyright ownership.
          #  The ASF licenses this file to You under the Apache License, Version 2.0
          #  (the "License"); you may not use this file except in compliance with
          #  the License.  You may obtain a copy of the License at
          #
          #       http://www.apache.org/licenses/LICENSE-2.0
          #
          #  Unless required by applicable law or agreed to in writing, software
          #  distributed under the License is distributed on an "AS IS" BASIS,
          #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          #  See the License for the specific language governing permissions and
          #  limitations under the License.
          
          # These methods from the Servlet API should not be used, because they are
          # either broken and slow in some environments (e.g., Jetty's UTF-8 readers),
          # or the parsing of request parameters is not using the correct encoding
          # without extra configuration in the servlet container:
          javax.servlet.ServletRequest#getReader()
          javax.servlet.ServletRequest#getParameter(java.lang.String) 
          javax.servlet.ServletRequest#getParameterMap() 
          javax.servlet.ServletRequest#getParameterNames() 
          javax.servlet.ServletRequest#getParameterValues(java.lang.String) 
          javax.servlet.ServletResponse#getWriter()
          

          It looks actuall quite good. Only SolrRequestParser and SolrParsm uses the parameter methods, but we can remove those methods. Interestingly SolrDispatchFilter only uses the Parameter maps for GET request, POST is handled automatically in a correct way (also the URL parameters). So the fix is easy: Just use the method from SolrRequestParsers also for GET and not only for post and pass the result to MultiParams.

          The remaining issues are the use of getReader() and getWriter() in Zookeeper servlets. We should fix that, too. getReader/getWriter is slow and broken for supplementary chars in Jetty and possibly other servers. We should always use InputStreams/OutputStreams and wrap the reader ourselves (like we do in DispatchFilter).

          Show
          Uwe Schindler added a comment - I started here and added a new forbidden API file for Solr: # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # These methods from the Servlet API should not be used, because they are # either broken and slow in some environments (e.g., Jetty's UTF-8 readers), # or the parsing of request parameters is not using the correct encoding # without extra configuration in the servlet container: javax.servlet.ServletRequest#getReader() javax.servlet.ServletRequest#getParameter(java.lang.String) javax.servlet.ServletRequest#getParameterMap() javax.servlet.ServletRequest#getParameterNames() javax.servlet.ServletRequest#getParameterValues(java.lang.String) javax.servlet.ServletResponse#getWriter() It looks actuall quite good. Only SolrRequestParser and SolrParsm uses the parameter methods, but we can remove those methods. Interestingly SolrDispatchFilter only uses the Parameter maps for GET request, POST is handled automatically in a correct way (also the URL parameters). So the fix is easy: Just use the method from SolrRequestParsers also for GET and not only for post and pass the result to MultiParams. The remaining issues are the use of getReader() and getWriter() in Zookeeper servlets. We should fix that, too. getReader/getWriter is slow and broken for supplementary chars in Jetty and possibly other servers. We should always use InputStreams/OutputStreams and wrap the reader ourselves (like we do in DispatchFilter).
          Hide
          Yonik Seeley added a comment -

          That's an invalid HTTP request.

          Yes, I know... but it's still nice to support it when possible.

          Show
          Yonik Seeley added a comment - That's an invalid HTTP request. Yes, I know... but it's still nice to support it when possible.
          Hide
          Uwe Schindler added a comment -

          Yes, I know... but it's still nice to support it when possible.

          We have no chance to do this, unless we add our own HTTP layer, e.g. with Netty. Tomcat and Jetty are only required to parse HTTP headers and the request line with US-ASCII.

          But we can fix the correct URLs by disallowing the forbidden APIs above. In addition I only have to fix ZookeeperInfoServlet, the remaining stuff is fine.

          Show
          Uwe Schindler added a comment - Yes, I know... but it's still nice to support it when possible. We have no chance to do this, unless we add our own HTTP layer, e.g. with Netty. Tomcat and Jetty are only required to parse HTTP headers and the request line with US-ASCII. But we can fix the correct URLs by disallowing the forbidden APIs above. In addition I only have to fix ZookeeperInfoServlet, the remaining stuff is fine.
          Hide
          Yonik Seeley added a comment -

          We have no chance to do this, unless we add our own HTTP layer, e.g. with Netty.

          Yeah, I know... that's what I was pointing out.
          Anyway, it works with pretty much all containers that I've ever tried once they are configured to accept percent-encoded UTF-8 URIS (and works with Jetty out of the box), so as long as we don't mess up what currently does work, I'm happy.

          Show
          Yonik Seeley added a comment - We have no chance to do this, unless we add our own HTTP layer, e.g. with Netty. Yeah, I know... that's what I was pointing out. Anyway, it works with pretty much all containers that I've ever tried once they are configured to accept percent-encoded UTF-8 URIS (and works with Jetty out of the box), so as long as we don't mess up what currently does work, I'm happy.
          Hide
          Yonik Seeley added a comment -

          For people looking to play around with this stuff, there is a helper script to test out a container:

          /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
          Solr server is up.
          HTTP GET is accepting UTF-8
          HTTP POST is accepting UTF-8
          HTTP POST defaults to UTF-8
          HTTP GET is accepting UTF-8 beyond the basic multilingual plane
          HTTP POST is accepting UTF-8 beyond the basic multilingual plane
          HTTP POST + URL params is accepting UTF-8 beyond the basic multilingual plane
          Response correctly returns UTF-8 beyond the basic multilingual plane
          
          Show
          Yonik Seeley added a comment - For people looking to play around with this stuff, there is a helper script to test out a container: /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh Solr server is up. HTTP GET is accepting UTF-8 HTTP POST is accepting UTF-8 HTTP POST defaults to UTF-8 HTTP GET is accepting UTF-8 beyond the basic multilingual plane HTTP POST is accepting UTF-8 beyond the basic multilingual plane HTTP POST + URL params is accepting UTF-8 beyond the basic multilingual plane Response correctly returns UTF-8 beyond the basic multilingual plane
          Hide
          Uwe Schindler added a comment -

          Patch:

          • adds several ServletRequest and ServletResponse methods to forbidden apis (getReader/getWriter/getParameter as those are unsafe with charsets)
          • Uses SolrRequestParsers to parse URL parameters and also form-encoded POST according to the standards. It also support form-encoded post data with different charsets than UTF-8 (but this charset is then also used for eventually given URL parameters)
          • Fixes the ZookeeperInfoServlet, too (not use unsafe Readers, Writers, getParameter)
          • The DebugServlet in the SolrJ tests is exclued from forbidden API

          All tests pass. I will do more tests with other servlet containers (Tomcat) and see if Solr works with UTF without any modifications.

          Show
          Uwe Schindler added a comment - Patch: adds several ServletRequest and ServletResponse methods to forbidden apis (getReader/getWriter/getParameter as those are unsafe with charsets) Uses SolrRequestParsers to parse URL parameters and also form-encoded POST according to the standards. It also support form-encoded post data with different charsets than UTF-8 (but this charset is then also used for eventually given URL parameters) Fixes the ZookeeperInfoServlet, too (not use unsafe Readers, Writers, getParameter) The DebugServlet in the SolrJ tests is exclued from forbidden API All tests pass. I will do more tests with other servlet containers (Tomcat) and see if Solr works with UTF without any modifications.
          Hide
          Uwe Schindler added a comment -

          Improvement: Use FastWriter instead of PrintWriter in the Zookeeper- and AdminServlet.

          Show
          Uwe Schindler added a comment - Improvement: Use FastWriter instead of PrintWriter in the Zookeeper- and AdminServlet.
          Hide
          Uwe Schindler added a comment -

          I quickly deployed solr.war from the unpatched and patched build to Tomcat 6.0.36 (which one is known to have problems). I only changed the port number in Tomcat to 8983 to make my test script working:

          • On Tomcat with unpatched Solr test_utf8.sh fails:
            bash-4.1$ ./test_utf8.sh
            Solr server is up.
            ERROR: HTTP GET is not accepting UTF-8
            HTTP POST is accepting UTF-8
            HTTP POST does not default to UTF-8
            ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
            HTTP POST is accepting UTF-8 beyond the basic multilingual plane
            ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic multilingual plane
            
          • With this patch all is fine:
            bash-4.1$ ./test_utf8.sh
            Solr server is up.
            HTTP GET is accepting UTF-8
            HTTP POST is accepting UTF-8
            HTTP POST defaults to UTF-8
            HTTP GET is accepting UTF-8 beyond the basic multilingual plane
            HTTP POST is accepting UTF-8 beyond the basic multilingual plane
            HTTP POST + URL params is accepting UTF-8 beyond the basic multilingual plane
            

          This is identical to Jetty. So we have a fix and no reconfiguration is needed in Tomcat anymore!

          Please note: The last test (ERROR: Response can't return UTF-8 beyond the basic multilingual plane) fails for me on cygwin with any servlet container, but passes on linux with UTF-8; but this is caused by the shell script which does not work on windows default locale (windows-1252).

          Show
          Uwe Schindler added a comment - I quickly deployed solr.war from the unpatched and patched build to Tomcat 6.0.36 (which one is known to have problems). I only changed the port number in Tomcat to 8983 to make my test script working: On Tomcat with unpatched Solr test_utf8.sh fails: bash-4.1$ ./test_utf8.sh Solr server is up. ERROR: HTTP GET is not accepting UTF-8 HTTP POST is accepting UTF-8 HTTP POST does not default to UTF-8 ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane HTTP POST is accepting UTF-8 beyond the basic multilingual plane ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic multilingual plane With this patch all is fine: bash-4.1$ ./test_utf8.sh Solr server is up. HTTP GET is accepting UTF-8 HTTP POST is accepting UTF-8 HTTP POST defaults to UTF-8 HTTP GET is accepting UTF-8 beyond the basic multilingual plane HTTP POST is accepting UTF-8 beyond the basic multilingual plane HTTP POST + URL params is accepting UTF-8 beyond the basic multilingual plane This is identical to Jetty. So we have a fix and no reconfiguration is needed in Tomcat anymore! Please note: The last test (ERROR: Response can't return UTF-8 beyond the basic multilingual plane) fails for me on cygwin with any servlet container, but passes on linux with UTF-8; but this is caused by the shell script which does not work on windows default locale (windows-1252).
          Hide
          Yonik Seeley added a comment -

          I did some manual testing, and one difference I notice is that on IE10 (Windows 8), pasting this into the address bar
          http://rogue.local:8983/solr/query?q=héllo
          Results in

          HTTP ERROR 500
          
          Problem accessing /solr/query. Reason: 
              {msg=Not valid UTF8! byte 6c in state 3,trace=org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte 6c in state 3
          	at org.eclipse.jetty.util.Utf8Appendable.appendByte(Utf8Appendable.java:174)
          	at org.eclipse.jetty.util.Utf8Appendable.append(Utf8Appendable.java:113)
          	at org.eclipse.jetty.http.HttpURI.toUtf8String(HttpURI.java:503)
          	at org.eclipse.jetty.http.HttpURI.getQuery(HttpURI.java:672)
          	at org.eclipse.jetty.server.Request.getQueryString(Request.java:835)
          	at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:395)
          	at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115)
          

          So it looks like IE10 doesn't percent encode the international character, but it wouldn't matter even if they did because it would be percent encoded latin-1 instead of UTF-8. It would probably work with Tomcat however (with or without this current patch).

          The old behavior did not result in an HTTP error, but I actually think this new behavior is preferable!
          Before this patch, the request was just incorrect and did not match the users intentions. At least now it will fail more quickly.

          Show
          Yonik Seeley added a comment - I did some manual testing, and one difference I notice is that on IE10 (Windows 8), pasting this into the address bar http://rogue.local:8983/solr/query?q=h éllo Results in HTTP ERROR 500 Problem accessing /solr/query. Reason: {msg=Not valid UTF8! byte 6c in state 3,trace=org.eclipse.jetty.util.Utf8Appendable$NotUtf8Exception: Not valid UTF8! byte 6c in state 3 at org.eclipse.jetty.util.Utf8Appendable.appendByte(Utf8Appendable.java:174) at org.eclipse.jetty.util.Utf8Appendable.append(Utf8Appendable.java:113) at org.eclipse.jetty.http.HttpURI.toUtf8String(HttpURI.java:503) at org.eclipse.jetty.http.HttpURI.getQuery(HttpURI.java:672) at org.eclipse.jetty.server.Request.getQueryString(Request.java:835) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:395) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) So it looks like IE10 doesn't percent encode the international character, but it wouldn't matter even if they did because it would be percent encoded latin-1 instead of UTF-8. It would probably work with Tomcat however (with or without this current patch). The old behavior did not result in an HTTP error, but I actually think this new behavior is preferable! Before this patch, the request was just incorrect and did not match the users intentions. At least now it will fail more quickly.
          Hide
          Dawid Weiss added a comment -

          See Uwe, not all browsers respect the new URI spec I checked on IE9 and it's the same. More interestingly, it converts somewhat intelligently because "łódź" becomes "lódz" (ó is present in ISO8859-1, other accented characters and ł need to be mapped to their ASCII equivalents; this isn't a simple codepage conversion from Unicode).

          Show
          Dawid Weiss added a comment - See Uwe, not all browsers respect the new URI spec I checked on IE9 and it's the same. More interestingly, it converts somewhat intelligently because "łódź" becomes "lódz" (ó is present in ISO8859-1, other accented characters and ł need to be mapped to their ASCII equivalents; this isn't a simple codepage conversion from Unicode).
          Hide
          Uwe Schindler added a comment -

          Yonik: Internet Exploder is wrong you are right. What we enforce here is that all URLs coming in are in valid-encoded UTF-8, which is also documented in the official Solr docs.

          The problem with IE is the standard one: Microsoft's sense of backwards compatibility, but should be easy to fix (but for some reason did not work here): Internet Options -> Advanced (tab) -> International -> Send UTF-8-URLs

          The old behavior did not result in an HTTP error, but I actually think this new behavior is preferable!

          I would also prefer the new behaviour: The user knows that he is wrong and will not complain with horrible discussions making Solr the bad guy being wrong with unicode in the end. User gets clear message that his URL was wrong -> and the stack trace contains Jetty on the top!

          I will do some small improvements in the patch soon and upload a new one. There is one inconsistency currently: When you have a POST request (form-encoded) but also add ?-parameters into the URL, the reported form-encoded charset (the one the client sends in the POST Content-Type) will be used to decode everything, also the URL params. This should be done separately. I have to refactor some code to do this correct [currently StandardRequestParser appends both the POST content and the URL-Query and decodes in one turn].

          Show
          Uwe Schindler added a comment - Yonik: Internet Exploder is wrong you are right. What we enforce here is that all URLs coming in are in valid-encoded UTF-8, which is also documented in the official Solr docs. The problem with IE is the standard one: Microsoft's sense of backwards compatibility, but should be easy to fix (but for some reason did not work here): Internet Options -> Advanced (tab) -> International -> Send UTF-8-URLs The old behavior did not result in an HTTP error, but I actually think this new behavior is preferable! I would also prefer the new behaviour: The user knows that he is wrong and will not complain with horrible discussions making Solr the bad guy being wrong with unicode in the end. User gets clear message that his URL was wrong -> and the stack trace contains Jetty on the top! I will do some small improvements in the patch soon and upload a new one. There is one inconsistency currently: When you have a POST request (form-encoded) but also add ?-parameters into the URL, the reported form-encoded charset (the one the client sends in the POST Content-Type) will be used to decode everything, also the URL params. This should be done separately. I have to refactor some code to do this correct [currently StandardRequestParser appends both the POST content and the URL-Query and decodes in one turn] .
          Hide
          Dawid Weiss added a comment -

          Yeah, it's actually "double encoded"...

          I know. I wasn't talking about the HTTP layer, I was referring to the servlet spec where the encoding for query string parsing in URIs is not defined and form-encoded POSTs are to use ISO-8859-1 by default. So neither Jetty nor Tomcat are "right" in how they handle parameter parsing from URIs, it was just underspecified from the beginning.

          When you add the HTTP layer above things get even more confusing because, according to the spec, HTTP headers must be in US-ASCII. This includes the URI line so any character outside of US-ASCII (including your example) is a violation of the HTTP protocol. In practice (perhaps sadly) the rules and implementations are much more relaxed and just accept any bytes (until eols) (as per your curl example).

          In theory it should be quite simple: if the encoding cannot be determined otherwise (from the origin page for forms, etc.) then it should be UTF-8 encoded, then URL-escaped. In practice you get all the combinations of behaviors both on the browser/http client side and on the server side.

          Talking about this – I think I know what IE is trying to do. Since the URL you're typing in the browser is pretty much an indication where to send a GET request + the resource locator (URI) of the resource you're trying to access (which will end up in the HTTP header section), it's probably trying it's best to convert non-ASCII characters to their ASCII equivalents. So "ł" becomes "l" etc. If my explanation is anywhere near right then you need to admit it does make some sense... although it is absolutely useless and absurd in practical terms since any resource accessible via 'ł' in its path shouldn't be aliased to "l": (pl) "łał" -> (en) "whoa", "lał" -> ([he] was pissing)...

          Talk about standards, eh?

          Show
          Dawid Weiss added a comment - Yeah, it's actually "double encoded"... I know. I wasn't talking about the HTTP layer, I was referring to the servlet spec where the encoding for query string parsing in URIs is not defined and form-encoded POSTs are to use ISO-8859-1 by default. So neither Jetty nor Tomcat are "right" in how they handle parameter parsing from URIs, it was just underspecified from the beginning. When you add the HTTP layer above things get even more confusing because, according to the spec, HTTP headers must be in US-ASCII. This includes the URI line so any character outside of US-ASCII (including your example) is a violation of the HTTP protocol. In practice (perhaps sadly) the rules and implementations are much more relaxed and just accept any bytes (until eols) (as per your curl example). In theory it should be quite simple: if the encoding cannot be determined otherwise (from the origin page for forms, etc.) then it should be UTF-8 encoded, then URL-escaped. In practice you get all the combinations of behaviors both on the browser/http client side and on the server side. Talking about this – I think I know what IE is trying to do. Since the URL you're typing in the browser is pretty much an indication where to send a GET request + the resource locator (URI) of the resource you're trying to access (which will end up in the HTTP header section), it's probably trying it's best to convert non-ASCII characters to their ASCII equivalents. So "ł" becomes "l" etc. If my explanation is anywhere near right then you need to admit it does make some sense... although it is absolutely useless and absurd in practical terms since any resource accessible via 'ł' in its path shouldn't be aliased to "l": (pl) "łał" -> (en) "whoa", "lał" -> ( [he] was pissing)... Talk about standards, eh?
          Hide
          Dawid Weiss added a comment -

          Some fun with IE and non-ASCII parameters. I have a Polish Windows (Windows-1250) but even Polish codepage characters are normalized to ASCII. A more interesting example is Russian where some glyphs could be translated into ASCII but are not (and some are – see Москва).

          Show
          Dawid Weiss added a comment - Some fun with IE and non-ASCII parameters. I have a Polish Windows (Windows-1250) but even Polish codepage characters are normalized to ASCII. A more interesting example is Russian where some glyphs could be translated into ASCII but are not (and some are – see Москва).
          Hide
          Alex Rocher added a comment -

          On my experience, all the application servers I've used have the same behavior : Tomcat, JBoss, GlassFish, Oracle Application Server, WebLogic and WebSphere.
          If character encoding is known through the HTTP request, then the parameter Map is correctly converted in unicode. If not known, the conversion is done in 8859-1 and its the servlet (or other) responsability to reencode each parameter properly.
          From html pages, I think it's the meta tag charset that define the charset in which the URL parameters should be encoded.

          As mentionned in a previous post, browsers never send the character encoding in the http request (IEx, Firefox, Chrome, Opera). The only way to force the charset in request is to use XMLHttpRequest JS API.

          OK with you : a parameter should always be in US7ASCII range, much more simple

          Show
          Alex Rocher added a comment - On my experience, all the application servers I've used have the same behavior : Tomcat, JBoss, GlassFish, Oracle Application Server, WebLogic and WebSphere. If character encoding is known through the HTTP request, then the parameter Map is correctly converted in unicode. If not known, the conversion is done in 8859-1 and its the servlet (or other) responsability to reencode each parameter properly. From html pages, I think it's the meta tag charset that define the charset in which the URL parameters should be encoded. As mentionned in a previous post, browsers never send the character encoding in the http request (IEx, Firefox, Chrome, Opera). The only way to force the charset in request is to use XMLHttpRequest JS API. OK with you : a parameter should always be in US7ASCII range, much more simple
          Hide
          Uwe Schindler added a comment -

          Alex: We have already fixed the problem by not using the servlet API's parameter decoder. Solr was already using an own one for the query string, but npot yet used everywhere.

          The documentation in Solr says: All request parameters must be %-encoded with UTF-8, so we cannot depend on what the application server thinks is right. So we decode as UTF-8 as e.g. Jetty does. If it is not UTF-8 you get an error.

          Show
          Uwe Schindler added a comment - Alex: We have already fixed the problem by not using the servlet API's parameter decoder. Solr was already using an own one for the query string, but npot yet used everywhere. The documentation in Solr says: All request parameters must be %-encoded with UTF-8, so we cannot depend on what the application server thinks is right. So we decode as UTF-8 as e.g. Jetty does. If it is not UTF-8 you get an error.
          Hide
          Alex Rocher added a comment -

          I've just look at the .patch and I understand the way in which you made the patch
          Thank's for all.

          Show
          Alex Rocher added a comment - I've just look at the .patch and I understand the way in which you made the patch Thank's for all.
          Hide
          Uwe Schindler added a comment -

          Here the final patch with some improvements, security fixes and refactoring:

          • The parser for POSTed application/x-www-form-urlencoded is now a separate SolrRequestParser. This makes code more clean. StnadrdRequestParser just delegates
          • The POST form data now is also limited in size as given in SolrConfig (like multipart messages). Defaults to 2 MB. This was also done previously in Jetty, but not configureable through solrconfig. Now its at one place and consistent between all servlet containers.
          • when posting form data but also give parameters in URL, the URL form contents appear before the POST form data. This is also how it is done in Jetty/Tomcat. This only affects duplicate parameters. Also I fixed the previous code to use UTF-8 always for the URL form data, but for the POSTed one the content-type is visited.
          • Some cleanups in the other Solr servlets, also reenable POST is the ZookeeperInfoServlet (it now uses SolrRequestParsers - so all servlets are consistent in how they get input parameters)

          I want to commit this later this evening. I hope all is fine. If you have some time, you can check it.

          It might be good to backport this to 3.6.3 (if it comes out), as this is a serious issue for many people using Tomcat instead of Jetty and that don't read the Solr Wiki how to correctly configure their container.

          Show
          Uwe Schindler added a comment - Here the final patch with some improvements, security fixes and refactoring: The parser for POSTed application/x-www-form-urlencoded is now a separate SolrRequestParser. This makes code more clean. StnadrdRequestParser just delegates The POST form data now is also limited in size as given in SolrConfig (like multipart messages). Defaults to 2 MB. This was also done previously in Jetty, but not configureable through solrconfig. Now its at one place and consistent between all servlet containers. when posting form data but also give parameters in URL, the URL form contents appear before the POST form data. This is also how it is done in Jetty/Tomcat. This only affects duplicate parameters. Also I fixed the previous code to use UTF-8 always for the URL form data, but for the POSTed one the content-type is visited. Some cleanups in the other Solr servlets, also reenable POST is the ZookeeperInfoServlet (it now uses SolrRequestParsers - so all servlets are consistent in how they get input parameters) I want to commit this later this evening. I hope all is fine. If you have some time, you can check it. It might be good to backport this to 3.6.3 (if it comes out), as this is a serious issue for many people using Tomcat instead of Jetty and that don't read the Solr Wiki how to correctly configure their container.
          Hide
          Uwe Schindler added a comment -

          I am currently investigation to be more srict with parameter encoding: Currently it is not an error if the %-encoded terms are not valid UTF-8 (URLDecoder of JDK replaces all invalid chars with ?). To make this more strict and fail correctly (not silently doing the wrong thing), we could use commons-codec's (we already use that library) to do a binary URL decoding: http://commons.apache.org/codec/api-release/org/apache/commons/codec/net/URLCodec.html

          URLCodec.decodeUrl takes byte[] and returns byte[]. Using this method we have full flexibility on throwing encoding errors. We can in that case also pass the byte[] contents from POST stream directly! Should we do this or not? The current approach is greedy like webservers that also accept almost any wrong encoded %XX stuff.

          Show
          Uwe Schindler added a comment - I am currently investigation to be more srict with parameter encoding: Currently it is not an error if the %-encoded terms are not valid UTF-8 (URLDecoder of JDK replaces all invalid chars with ?). To make this more strict and fail correctly (not silently doing the wrong thing), we could use commons-codec's (we already use that library) to do a binary URL decoding: http://commons.apache.org/codec/api-release/org/apache/commons/codec/net/URLCodec.html URLCodec.decodeUrl takes byte[] and returns byte[]. Using this method we have full flexibility on throwing encoding errors. We can in that case also pass the byte[] contents from POST stream directly! Should we do this or not? The current approach is greedy like webservers that also accept almost any wrong encoded %XX stuff.
          Hide
          Uwe Schindler added a comment -

          Previous patch had an integer overflow in limitUploadMB. Fixed.

          Show
          Uwe Schindler added a comment - Previous patch had an integer overflow in limitUploadMB. Fixed.
          Hide
          Dawid Weiss added a comment -

          I'm fine with this patch if you want to commit it although I'd rather see out-of-the-box compliance against a typical browser behavior? Namely, if you have a POST form, with a target URL that contains (unescaped) unicode characters, the browser will use the originating page's encoding for both POST parameters and the URL. Say, something like:

              <form action="echo.jsp?abc=łódź" method="post" enctype="application/x-www-form-urlencoded">
                <input type="text" name="blah"><br>
                <input type="submit" value="Submit">
              </form>    
          

          Enforcing UTF-8 but respecting POST encoding for the body seems like we're creating custom rules which, given the mess we talked about above, should probably be avoided. The rules for me should be fairly simple:

          • try to get character encoding from HTTP header; if not present, assume UTF-8
          • decode the URI and the body (if POST) using the above encoding. If decoder failures occur, return HTTP BAD_REQUEST.

          The advantage here is that these rules won't surprise people that have simple HTML forms (for some reason with partial query string already hardcoded in the action attribute). If you apply your patch, you'll have to url-escape UTF-8 yourself (assuming the page's encoding is not UTF-8).

          Just a thought, I don't have a strong opinion about this.

          Show
          Dawid Weiss added a comment - I'm fine with this patch if you want to commit it although I'd rather see out-of-the-box compliance against a typical browser behavior? Namely, if you have a POST form, with a target URL that contains (unescaped) unicode characters, the browser will use the originating page's encoding for both POST parameters and the URL. Say, something like: <form action= "echo.jsp?abc=łódź" method= "post" enctype= "application/x-www-form-urlencoded" > <input type= "text" name= "blah" ><br> <input type= "submit" value= "Submit" > </form> Enforcing UTF-8 but respecting POST encoding for the body seems like we're creating custom rules which, given the mess we talked about above, should probably be avoided. The rules for me should be fairly simple: try to get character encoding from HTTP header; if not present, assume UTF-8 decode the URI and the body (if POST) using the above encoding. If decoder failures occur, return HTTP BAD_REQUEST. The advantage here is that these rules won't surprise people that have simple HTML forms (for some reason with partial query string already hardcoded in the action attribute). If you apply your patch, you'll have to url-escape UTF-8 yourself (assuming the page's encoding is not UTF-8). Just a thought, I don't have a strong opinion about this.
          Hide
          Uwe Schindler added a comment -

          Dawid: If the encoding for the URL would be different than UTF-8 we are again at the problems we had before. So everything that comes in as part of the URL has to be UTF-8 URLEncoded.

          The additional formdata in POST has to be handled like the standard wants: You have to respect the encoding given in the Content-Type. But you can be sure: No browser actually sends it, so it is also always defaulting to UTF-8. But if somebody sets the encoding, it must be rrespected.

          For URLS you have no chance, as the encoding cannot be submitted wih the HTTP request, unless HTTP/1.2 defines a new header that defines encoding for the URL

          The polish example above is broken alltogether, as the URL in action is not a valid URL. It also has the same problems as it is is not cross-browser supported in the same way. The user will get Jetty errors like discussed above (with IE) or otherwise broken behaviour. There is nothing we can do.

          • try to get character encoding from HTTP header; if not present, assume UTF-8
          • decode the URI and the body (if POST) using the above encoding. If decoder failures occur, return HTTP BAD_REQUEST.

          The body encoding says nothing about the URL encoding. Also you cannot give it for GET requests. So enforcing URLs to be UTF-8 is consistent.

          Uwe

          Show
          Uwe Schindler added a comment - Dawid: If the encoding for the URL would be different than UTF-8 we are again at the problems we had before. So everything that comes in as part of the URL has to be UTF-8 URLEncoded. The additional formdata in POST has to be handled like the standard wants: You have to respect the encoding given in the Content-Type. But you can be sure: No browser actually sends it, so it is also always defaulting to UTF-8. But if somebody sets the encoding, it must be rrespected. For URLS you have no chance, as the encoding cannot be submitted wih the HTTP request, unless HTTP/1.2 defines a new header that defines encoding for the URL The polish example above is broken alltogether, as the URL in action is not a valid URL. It also has the same problems as it is is not cross-browser supported in the same way. The user will get Jetty errors like discussed above (with IE) or otherwise broken behaviour. There is nothing we can do. try to get character encoding from HTTP header; if not present, assume UTF-8 decode the URI and the body (if POST) using the above encoding. If decoder failures occur, return HTTP BAD_REQUEST. The body encoding says nothing about the URL encoding. Also you cannot give it for GET requests. So enforcing URLs to be UTF-8 is consistent. Uwe
          Hide
          Uwe Schindler added a comment -

          More cleanup, make the maximum form data size separately configureable.

          Show
          Uwe Schindler added a comment - More cleanup, make the maximum form data size separately configureable.
          Hide
          Dawid Weiss added a comment -

          I disagree with you here. The example is fine because browsers will escape the action URI for you; they'll just use the codepage of the origin HTML, whatever it was. So it's a perfectly valid HTTP request and a perfectly valid query. The fact that browser don't send the information which encoding they used for doing so is of no relevance. You can try it, the form above would be sent (from an Windows1250 encoded Web page) as:

          $ nc -l 8081
          POST /echo.jsp?abc=%B3%F3d%9F HTTP/1.1
          Host: localhost:8081
          Connection: keep-alive
          Content-Length: 15
          

          The URL is invalid in the HTML source but browsers will "fix" it for you (url-escape) and many people accept it as something ordinary.

          The body encoding says nothing about the URL encoding.

          It does in the context of browsers. See my example above or try it yourself:

          <%@ page language="java" contentType="text/html; charset=Cp1250" pageEncoding="UTF-8"%>
          <html>
            <body>
              <form action="http://localhost:8081/echo.jsp?abc=łódź" method="post" enctype="application/x-www-form-urlencoded">
                <input type="text" name="blah"><br>
                <input type="submit" value="Submit">
              </form>    
            </body>
          </html>
          
          Show
          Dawid Weiss added a comment - I disagree with you here. The example is fine because browsers will escape the action URI for you; they'll just use the codepage of the origin HTML, whatever it was. So it's a perfectly valid HTTP request and a perfectly valid query. The fact that browser don't send the information which encoding they used for doing so is of no relevance. You can try it, the form above would be sent (from an Windows1250 encoded Web page) as: $ nc -l 8081 POST /echo.jsp?abc=%B3%F3d%9F HTTP/1.1 Host: localhost:8081 Connection: keep-alive Content-Length: 15 The URL is invalid in the HTML source but browsers will "fix" it for you (url-escape) and many people accept it as something ordinary. The body encoding says nothing about the URL encoding. It does in the context of browsers. See my example above or try it yourself: <%@ page language= "java" contentType= "text/html; charset=Cp1250" pageEncoding= "UTF-8" %> <html> <body> <form action= "http: //localhost:8081/echo.jsp?abc=łódź" method= "post" enctype= "application/x-www-form-urlencoded" > <input type= "text" name= "blah" ><br> <input type= "submit" value= "Submit" > </form> </body> </html>
          Hide
          Dawid Weiss added a comment -

          I still think it would have been more consistent to apply this to both the body and the URI (like Tomcat does if you request so in the config) but you can disregard me here. Maybe I'm paranoid.

          Show
          Dawid Weiss added a comment - I still think it would have been more consistent to apply this to both the body and the URI (like Tomcat does if you request so in the config) but you can disregard me here. Maybe I'm paranoid.
          Hide
          Uwe Schindler added a comment -

          The URL is invalid in the HTML source but browsers will "fix" it for you (url-escape) and many people accept it as something ordinary.

          Thats the whole problem, I agree. But we don't know what the browser used as encoding for encoding the URL. So without a charset given somewhere we cannot decode the URL. You have the problem with GET and POST requests. Because of this you cannot fix this for SOLR.

          The patch does not change any behaviour in guessing charsets from before, the only change here is the encoding used to decode URLs (which is now "UTF-8" because several web containers handle this in a different way). Jetty and Tomcat both handled POST content respecting the charset of the POST BODY - and that did not change.

          Where is your problem with Solr? The whole discussion could be flame-wared on the Jetty or Tomcat lists as before, unfortunately the HTTP spec and the Servlet spec and the URL spec are not precise enough. For Solr it is not an issue: Solr is documented to only accept URL encoded request parameters as UTF-8.

          The only way to change this would be to do it like search engine. They allow to pass in an "ie=" extra GET parameter that defines the "input encoding" of the URL parameters. In that case you could do a 2-step URL parsing approach (or use commons-codec: decode the binary url from the byte[] and then interpret the "ie" parameter as US-ASCII and use it to decode the remaining parameters.

          Show
          Uwe Schindler added a comment - The URL is invalid in the HTML source but browsers will "fix" it for you (url-escape) and many people accept it as something ordinary. Thats the whole problem, I agree. But we don't know what the browser used as encoding for encoding the URL. So without a charset given somewhere we cannot decode the URL. You have the problem with GET and POST requests. Because of this you cannot fix this for SOLR. The patch does not change any behaviour in guessing charsets from before, the only change here is the encoding used to decode URLs (which is now "UTF-8" because several web containers handle this in a different way). Jetty and Tomcat both handled POST content respecting the charset of the POST BODY - and that did not change. Where is your problem with Solr? The whole discussion could be flame-wared on the Jetty or Tomcat lists as before, unfortunately the HTTP spec and the Servlet spec and the URL spec are not precise enough. For Solr it is not an issue: Solr is documented to only accept URL encoded request parameters as UTF-8. The only way to change this would be to do it like search engine. They allow to pass in an "ie=" extra GET parameter that defines the "input encoding" of the URL parameters. In that case you could do a 2-step URL parsing approach (or use commons-codec: decode the binary url from the byte[] and then interpret the "ie" parameter as US-ASCII and use it to decode the remaining parameters.
          Hide
          Uwe Schindler added a comment - - edited

          I still think it would have been more consistent to apply this to both the body and the URI (like Tomcat does if you request so in the config) but you can disregard me here. Maybe I'm paranoid.

          Tomcat and Jetty default to a charset (according to the config). This charset is used in any case for the URL (it is always used for the URL!!!) - and it is used for the POST content if the charset is not given. If the Content-Type of the POST body is given, Jetty and Tomcat use the one given in content type.

          If you have a pure GET request its disallowed to set "Content-Type" so the charset is lost.

          Show
          Uwe Schindler added a comment - - edited I still think it would have been more consistent to apply this to both the body and the URI (like Tomcat does if you request so in the config) but you can disregard me here. Maybe I'm paranoid. Tomcat and Jetty default to a charset (according to the config). This charset is used in any case for the URL (it is always used for the URL!!!) - and it is used for the POST content if the charset is not given. If the Content-Type of the POST body is given, Jetty and Tomcat use the one given in content type. If you have a pure GET request its disallowed to set "Content-Type" so the charset is lost.
          Hide
          Uwe Schindler added a comment -

          Just to mention finally: No browser is ever setting a charset in the Content-Type header for "application/x-www-form-urlencoded", so we will always use UTF-8 (in fact its an "application/" mime type that requires no charset as it is strongly spoken binary data).

          The special case you are talking about is just that somebody explicitely sets a charset for the BODY (not the URL!) in a hand-made HTTP request. If somebody does this, he knows what he does. But in that case, the URL is still defined to be UTF-8 encoded (as the standard says and Solr requires). But as we have a encoding given in the conent type, we must parse the body using the given charset. There is no way around.

          Show
          Uwe Schindler added a comment - Just to mention finally: No browser is ever setting a charset in the Content-Type header for "application/x-www-form-urlencoded", so we will always use UTF-8 (in fact its an "application/" mime type that requires no charset as it is strongly spoken binary data). The special case you are talking about is just that somebody explicitely sets a charset for the BODY (not the URL!) in a hand-made HTTP request. If somebody does this, he knows what he does. But in that case, the URL is still defined to be UTF-8 encoded (as the standard says and Solr requires). But as we have a encoding given in the conent type, we must parse the body using the given charset. There is no way around.
          Hide
          Dawid Weiss added a comment -

          Tomcat and Jetty default to a charset (according to the config). This charset is used in any case for the URL (it is always used for the URL!!!) - and it is used for the POST content if the charset is not given.

          It's actually the other way around: you can force the charset for decoding URIs of POSTs and GETs that don't specify their own charset in HTTP headers using useBodyEncodingForURI (Tomcat).

          http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q2

          Anyway, it's fine, I think this patch takes it a lot further than it was anyway.

          Show
          Dawid Weiss added a comment - Tomcat and Jetty default to a charset (according to the config). This charset is used in any case for the URL (it is always used for the URL!!!) - and it is used for the POST content if the charset is not given. It's actually the other way around: you can force the charset for decoding URIs of POSTs and GETs that don't specify their own charset in HTTP headers using useBodyEncodingForURI (Tomcat). http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q2 Anyway, it's fine, I think this patch takes it a lot further than it was anyway.
          Hide
          Uwe Schindler added a comment -

          In Tomcat you can do this. Unfortunately, the default for Tomcat is to always use ISO-8859-1 as encoding for URL parameters and use the submitted body encoding (if available) for the form data. Jetty does the same, see http://grepcode.com/file/repo1.maven.org/maven2/org.eclipse.jetty/jetty-server/8.1.7.v20120910/org/eclipse/jetty/server/Request.java#Request.extractParameters%28%29.

          I will commit this patch later if nobody objects. We can open another issue to improve error handling on invalid encodings using commons-codec for URL decoding. This could also add the "ie=charset" request parameter (like Google and BING and others do), to specify the encoding of the URL request parameters.

          Show
          Uwe Schindler added a comment - In Tomcat you can do this. Unfortunately, the default for Tomcat is to always use ISO-8859-1 as encoding for URL parameters and use the submitted body encoding (if available) for the form data. Jetty does the same, see http://grepcode.com/file/repo1.maven.org/maven2/org.eclipse.jetty/jetty-server/8.1.7.v20120910/org/eclipse/jetty/server/Request.java#Request.extractParameters%28%29 . I will commit this patch later if nobody objects. We can open another issue to improve error handling on invalid encodings using commons-codec for URL decoding. This could also add the "ie=charset" request parameter (like Google and BING and others do), to specify the encoding of the URL request parameters.
          Hide
          Yonik Seeley added a comment -

          Jetty does the same

          In my experience, Jetty defaulted to UTF-8 for the URL encoding. AFAIK, we've never needed to tweak jetty charset config.

          Show
          Yonik Seeley added a comment - Jetty does the same In my experience, Jetty defaulted to UTF-8 for the URL encoding. AFAIK, we've never needed to tweak jetty charset config.
          Hide
          Uwe Schindler added a comment -

          In my experience, Jetty defaulted to UTF-8 for the URL encoding. AFAIK, we've never needed to tweak jetty charset config.

          Yes, thats right. Dawid and I were discussing about the behaviour regarding the encoding from POST content (if it overrides the URL encoding or not), although the used defaults were different in both web container. The difference with Jetty was simply that the encoding for URL parameters is UTF-8, see the above code from Request.java where it parses the query string.

          The final patch does exactly the same like jetty always did, only the default is fixed to be UTF-8 on any web container.

          Show
          Uwe Schindler added a comment - In my experience, Jetty defaulted to UTF-8 for the URL encoding. AFAIK, we've never needed to tweak jetty charset config. Yes, thats right. Dawid and I were discussing about the behaviour regarding the encoding from POST content (if it overrides the URL encoding or not), although the used defaults were different in both web container. The difference with Jetty was simply that the encoding for URL parameters is UTF-8, see the above code from Request.java where it parses the query string. The final patch does exactly the same like jetty always did, only the default is fixed to be UTF-8 on any web container.
          Hide
          Uwe Schindler added a comment -

          Final patch for trunk, now committing it.

          I added the new maximum formdata setting to example solrconfigs.

          I also added some checks in the formdata request parser, so it fails when you have changed the solr webapplication and added an incompatible Filter before SolrDispatchFilter that calls Request#getParameter*() and causes the servlet container to incorrectly jump in.

          Show
          Uwe Schindler added a comment - Final patch for trunk, now committing it. I added the new maximum formdata setting to example solrconfigs. I also added some checks in the formdata request parser, so it fails when you have changed the solr webapplication and added an incompatible Filter before SolrDispatchFilter that calls Request#getParameter*() and causes the servlet container to incorrectly jump in.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1429534

          SOLR-4265: Fix decoding of GET/POST parameters for servlet containers with non-UTF-8 URL parsing (Tomcat)

          Show
          Commit Tag Bot added a comment - [trunk commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1429534 SOLR-4265 : Fix decoding of GET/POST parameters for servlet containers with non-UTF-8 URL parsing (Tomcat)
          Hide
          Uwe Schindler added a comment -

          OK, I committed the new stuff. I will now check and update Solr Wiki.

          Show
          Uwe Schindler added a comment - OK, I committed the new stuff. I will now check and update Solr Wiki.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1429535

          Merged revision(s) 1429534 from lucene/dev/trunk:
          SOLR-4265: Fix decoding of GET/POST parameters for servlet containers with non-UTF-8 URL parsing (Tomcat)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1429535 Merged revision(s) 1429534 from lucene/dev/trunk: SOLR-4265 : Fix decoding of GET/POST parameters for servlet containers with non-UTF-8 URL parsing (Tomcat)
          Hide
          Uwe Schindler added a comment -

          I fixed also the Solr Wiki pages of Tomcat, Glassfish,...

          Show
          Uwe Schindler added a comment - I fixed also the Solr Wiki pages of Tomcat, Glassfish,...
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1429822

          SOLR-4265: Remove useless setting in Jetty's server config file, as maximum form content is now handled by solr's RequestParser and not the web container.

          Show
          Commit Tag Bot added a comment - [trunk commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1429822 SOLR-4265 : Remove useless setting in Jetty's server config file, as maximum form content is now handled by solr's RequestParser and not the web container.
          Hide
          Uwe Schindler added a comment -

          I committed the removal of the Jetty-specifi example setting for maximum POST formdata size (as the webapp now handles this by its own code and own limits).

          Show
          Uwe Schindler added a comment - I committed the removal of the Jetty-specifi example setting for maximum POST formdata size (as the webapp now handles this by its own code and own limits).
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1429824

          Merged revision(s) 1429822 from lucene/dev/trunk:
          SOLR-4265: Remove useless setting in Jetty's server config file, as maximum form content is now handled by solr's RequestParser and not the web container.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1429824 Merged revision(s) 1429822 from lucene/dev/trunk: SOLR-4265 : Remove useless setting in Jetty's server config file, as maximum form content is now handled by solr's RequestParser and not the web container.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1429958

          SOLR-4265: Fix charset bug...

          Show
          Commit Tag Bot added a comment - [trunk commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1429958 SOLR-4265 : Fix charset bug...
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1429959

          Merged revision(s) 1429958 from lucene/dev/trunk:
          SOLR-4265: Fix charset bug...

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1429959 Merged revision(s) 1429958 from lucene/dev/trunk: SOLR-4265 : Fix charset bug...

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Alex Rocher
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development