Solr
  1. Solr
  2. SOLR-443

POST queries don't declare its charset

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.3
    • Component/s: clients - java
    • Labels:
      None
    • Environment:

      Tomcat 6.0.14

      Description

      When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.

      On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

      1. solr-443.patch
        2 kB
        Ryan McKinley
      2. solr-443.patch
        1 kB
        Andrew Schurman
      3. SOLR-443-multipart.patch
        7 kB
        Lars Kotthoff
      4. SolrDispatchFilter.patch
        0.5 kB
        Hiroaki Kawai

        Issue Links

          Activity

          Hide
          Andrew Schurman added a comment -

          Simple fix that will fix the issue for this case. I don't believe it will cause issues elsewhere within the java client.

          Show
          Andrew Schurman added a comment - Simple fix that will fix the issue for this case. I don't believe it will cause issues elsewhere within the java client.
          Hide
          Ryan McKinley added a comment -

          Andrew, does this patch work for you?

          rather then specify the contentType for all POST request, it only adds it for ones that don't specify it within a ContentStream

          Show
          Ryan McKinley added a comment - Andrew, does this patch work for you? rather then specify the contentType for all POST request, it only adds it for ones that don't specify it within a ContentStream
          Hide
          Andrew Schurman added a comment -

          Haven't had a chance to test that, but I believe that would work also since we are only sending non-multipart POSTs anyways.

          Show
          Andrew Schurman added a comment - Haven't had a chance to test that, but I believe that would work also since we are only sending non-multipart POSTs anyways.
          Hide
          Yonik Seeley added a comment -

          The problem is, the body isn't really in UTF8. Here's a request from SolrJ with the patch:

          POST /solr/select HTTP/1.1
          Content-Type: application/x-www-form-urlencoded; charset=UTF-8
          User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0
          Host: localhost:8983
          Content-Length: 42
          
          q=features%3Ah%C3%A9llo&wt=xml&version=2.2
          

          The SolrJ code is

              SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
              ModifiableSolrParams params = new ModifiableSolrParams();
              QueryRequest req = new QueryRequest(params);
              params.set("q","features:h\u00E9llo");
              req.setMethod(SolrRequest.METHOD.POST);
              QueryResponse rsp = server.query(params);
          

          What HttpClient is outputing is percent encoded UTF8 bytes (and that's not UTF-8). So the charset here really isn't the problem, because the body is nothing but ASCII. The body coding matches the type of coding specified in the URI RFC http://www.ietf.org/rfc/rfc3986.txt
          But that only specifies the coding for parameters that go in the URI.
          I haven't been able to find an updated standard that specifies percent encoded UTF-8 bytes for application/x-www-form-urlencoded. Does anyone know if there is one?

          Anyway, long story short is that this may still fail on Tomcat.

          Show
          Yonik Seeley added a comment - The problem is, the body isn't really in UTF8. Here's a request from SolrJ with the patch: POST /solr/select HTTP/1.1 Content-Type: application/x-www-form-urlencoded; charset=UTF-8 User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0 Host: localhost:8983 Content-Length: 42 q=features%3Ah%C3%A9llo&wt=xml&version=2.2 The SolrJ code is SolrServer server = new CommonsHttpSolrServer( "http: //localhost:8983/solr" ); ModifiableSolrParams params = new ModifiableSolrParams(); QueryRequest req = new QueryRequest(params); params.set( "q" , "features:h\u00E9llo" ); req.setMethod(SolrRequest.METHOD.POST); QueryResponse rsp = server.query(params); What HttpClient is outputing is percent encoded UTF8 bytes (and that's not UTF-8). So the charset here really isn't the problem, because the body is nothing but ASCII. The body coding matches the type of coding specified in the URI RFC http://www.ietf.org/rfc/rfc3986.txt But that only specifies the coding for parameters that go in the URI. I haven't been able to find an updated standard that specifies percent encoded UTF-8 bytes for application/x-www-form-urlencoded. Does anyone know if there is one? Anyway, long story short is that this may still fail on Tomcat.
          Hide
          Andrew Schurman added a comment -

          I believe your right Yonik. I think when I was testing I forgot to remove a filter that I was using to convert the request into UTF8. I'm now testing again and it still appears to process the results inconsistently.

          Show
          Andrew Schurman added a comment - I believe your right Yonik. I think when I was testing I forgot to remove a filter that I was using to convert the request into UTF8. I'm now testing again and it still appears to process the results inconsistently.
          Hide
          Andrew Schurman added a comment -

          Hmm... I just tested the latest patch on a different machine with Tomcat 6.0.14 and it does appear to work (I must have some sort of caching problem on my other machine).

          As for standards, I don't believe it's updated, but I found HTML Internationalization RFC http://www.ietf.org/rfc/rfc2070.txt. On page 16, it mentions that setting the charset with a content-type of x-www-form-urlencoded should have the understanding that the "URL encoding of [RFC1738] is applied on top of the specified character encoding, as a kind of implicit Content-Transfer-Encoding". In this case, it does seem valid to be setting the charset on the post.

          Show
          Andrew Schurman added a comment - Hmm... I just tested the latest patch on a different machine with Tomcat 6.0.14 and it does appear to work (I must have some sort of caching problem on my other machine). As for standards, I don't believe it's updated, but I found HTML Internationalization RFC http://www.ietf.org/rfc/rfc2070.txt . On page 16, it mentions that setting the charset with a content-type of x-www-form-urlencoded should have the understanding that the "URL encoding of [RFC1738] is applied on top of the specified character encoding, as a kind of implicit Content-Transfer-Encoding". In this case, it does seem valid to be setting the charset on the post.
          Hide
          Hiroaki Kawai added a comment -

          This patch will fix the issue.

          New in Servlet Spec 2.5, we can specify expected incoming encoding rather than decoding it as ISO-8859 string.
          http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#setCharacterEncoding(java.lang.String)

          The patch will only work with servlet engine implementing servlet 2.5, (i.e, Tomcat6 or like that), but I think this is the most desirable way.

          Show
          Hiroaki Kawai added a comment - This patch will fix the issue. New in Servlet Spec 2.5, we can specify expected incoming encoding rather than decoding it as ISO-8859 string. http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#setCharacterEncoding(java.lang.String ) The patch will only work with servlet engine implementing servlet 2.5, (i.e, Tomcat6 or like that), but I think this is the most desirable way.
          Hide
          Lars Kotthoff added a comment -

          After reading http://www.w3.org/TR/html401/interact/forms.html#form-content-type it seems to me that the only reliable way to ensure that the data is encoded/decoded properly is to send the request parameters as parts of a multi-part request. The charset of each part can be set to UTF-8, the content-type header is generated by httpclient, and nothing needs to be url-encoded.

          The downside is that the size of requests becomes larger, as there's quite a lot of overhead when putting each parameter into a separate part.

          Attached the patch "SOLR-443-multipart.patch" which makes the necessary changes to CommonsHttpSolrServer. Verified to work with the Jetty version used in the tests and Tomcat 5.5.

          A possible optimisation would be to check each parameter for non-ascii characters and only make it a new part if it does, otherwise just include it as a parameter.

          Show
          Lars Kotthoff added a comment - After reading http://www.w3.org/TR/html401/interact/forms.html#form-content-type it seems to me that the only reliable way to ensure that the data is encoded/decoded properly is to send the request parameters as parts of a multi-part request. The charset of each part can be set to UTF-8, the content-type header is generated by httpclient, and nothing needs to be url-encoded. The downside is that the size of requests becomes larger, as there's quite a lot of overhead when putting each parameter into a separate part. Attached the patch " SOLR-443 -multipart.patch" which makes the necessary changes to CommonsHttpSolrServer. Verified to work with the Jetty version used in the tests and Tomcat 5.5. A possible optimisation would be to check each parameter for non-ascii characters and only make it a new part if it does, otherwise just include it as a parameter.
          Hide
          Yonik Seeley added a comment -

          I just tested the latest 5.5 tomcat (5.5.26).
          It appears that the coding of x-www-form-urlencoded assumes the the same as that of the URI encoding now (percent encoded UTF-8 rather than latin-1 if configured that way). I'm not sure if it was like that in the past, but it works now at least!

          Just set URIEncoding="UTF-8" for the connector...
          see http://wiki.apache.org/solr/SolrTomcat under "URI Charset Config"

          Show
          Yonik Seeley added a comment - I just tested the latest 5.5 tomcat (5.5.26). It appears that the coding of x-www-form-urlencoded assumes the the same as that of the URI encoding now (percent encoded UTF-8 rather than latin-1 if configured that way). I'm not sure if it was like that in the past, but it works now at least! Just set URIEncoding="UTF-8" for the connector... see http://wiki.apache.org/solr/SolrTomcat under "URI Charset Config"
          Hide
          Lars Kotthoff added a comment -

          I'm also using tomcat 5.5.26 here, but I can't reproduce that behaviour. I've tested on two different machines, but my tomcat always assumes that the POST body is url-encoded ISO-8859-1; that is, when I use the current SVN version, it only works for ascii characters (encoding is the same in ISO-8859-1 and UTF-8). If I remove the line that sets the encoding of the POST body to UTF-8, it works for all ISO-8859-1 characters, as httpclient encodes to ISO-8859-1 by default.

          I'm very much in favour of a solution which works because all encodings are specified in the proper places as opposed to something that just happens to work with a "standard" configuration, but is not covered by any internet standard. This would be a timebomb just waiting to go off when somebody switches servlet container versions/configurations.

          Worse still, this problem is likely to affect people who are just using and not writing their own code for Solr and don't know anything about the internals (cf. SOLR-303). And they aren't going to get an error message telling them that the character encoding is wrong, but a NullPointerException from the bowels of the faceting code.

          The overhead from using multi-part requests may be considerable, but I don't think that network I/O and processing of network messages is likely to become a bottleneck in typical Solr applications.

          Show
          Lars Kotthoff added a comment - I'm also using tomcat 5.5.26 here, but I can't reproduce that behaviour. I've tested on two different machines, but my tomcat always assumes that the POST body is url-encoded ISO-8859-1; that is, when I use the current SVN version, it only works for ascii characters (encoding is the same in ISO-8859-1 and UTF-8). If I remove the line that sets the encoding of the POST body to UTF-8, it works for all ISO-8859-1 characters, as httpclient encodes to ISO-8859-1 by default. I'm very much in favour of a solution which works because all encodings are specified in the proper places as opposed to something that just happens to work with a "standard" configuration, but is not covered by any internet standard. This would be a timebomb just waiting to go off when somebody switches servlet container versions/configurations. Worse still, this problem is likely to affect people who are just using and not writing their own code for Solr and don't know anything about the internals (cf. SOLR-303 ). And they aren't going to get an error message telling them that the character encoding is wrong, but a NullPointerException from the bowels of the faceting code. The overhead from using multi-part requests may be considerable, but I don't think that network I/O and processing of network messages is likely to become a bottleneck in typical Solr applications.
          Hide
          Yonik Seeley added a comment -

          Did you try setting URIEncoding="UTF-8" on the connector?
          Without that, you can't even correctly do a query that contains international chars.

          I indexed the example data, and with standard tomcat config, verified that SolrJ found nothing when searching for hello (with an accent over the e... it's in solr.xml) with both GET and POST.
          After editing the tomcat config and switching it to UTF-8, both GET and POST correctly find the solr example document.

          a NullPointerException from the bowels of the faceting code.

          That seems like a related but separate issue, and it would be nice if it were handled more gracefully.

          Show
          Yonik Seeley added a comment - Did you try setting URIEncoding="UTF-8" on the connector? Without that, you can't even correctly do a query that contains international chars. I indexed the example data, and with standard tomcat config, verified that SolrJ found nothing when searching for hello (with an accent over the e... it's in solr.xml) with both GET and POST. After editing the tomcat config and switching it to UTF-8, both GET and POST correctly find the solr example document. a NullPointerException from the bowels of the faceting code. That seems like a related but separate issue, and it would be nice if it were handled more gracefully.
          Hide
          Lars Kotthoff added a comment -

          Did you try setting URIEncoding="UTF-8" on the connector?
          Without that, you can't even correctly do a query that contains international chars.

          Yes. A lot of the queries I issue are in Japanese

          I should add that I'm using the debian flavour of Tomcat, the exact version number is 5.5.26-3. I don't know whether this version is patched in a way that affects this, but the Tomcat documentation (http://tomcat.apache.org/tomcat-5.5-doc/config/http.html) specifically mentions decoding the URL for that setting. That may or may not be intentional, but I'm pretty sure that the behaviour you're seeing is "accidental".

          As for the NPE, it occurs when a request for facet counts returns something for a facet value which wasn't in the request. I think that it should only be handled more gracefully to the extent of giving a more meaningful error message. But there's no need to if the underlying issue is fixed

          Show
          Lars Kotthoff added a comment - Did you try setting URIEncoding="UTF-8" on the connector? Without that, you can't even correctly do a query that contains international chars. Yes. A lot of the queries I issue are in Japanese I should add that I'm using the debian flavour of Tomcat, the exact version number is 5.5.26-3. I don't know whether this version is patched in a way that affects this, but the Tomcat documentation ( http://tomcat.apache.org/tomcat-5.5-doc/config/http.html ) specifically mentions decoding the URL for that setting. That may or may not be intentional, but I'm pretty sure that the behaviour you're seeing is "accidental". As for the NPE, it occurs when a request for facet counts returns something for a facet value which wasn't in the request. I think that it should only be handled more gracefully to the extent of giving a more meaningful error message. But there's no need to if the underlying issue is fixed
          Hide
          Yonik Seeley added a comment -

          You're right Lars, setting the URIEncoding didn't work for Tomcat.

          I checked in a test program: solr/example/exampledocs/test_utf8.sh
          It seems that using

          Content-Type: application/x-www-form-urlencoded; charset=UTF-8
          

          works for Jetty, Tomcat (I tested 5.5), and Resin (I tested 3.1)

          On a related note, I checked in a fix for distributed faceting refinement to ignore facet.query values that it doesn't know about. It's unfortunate that it will hide this problem (that's why i made the UTF8 test script), but it seems like the correct thing to do since another component may add additional request parts.

          Show
          Yonik Seeley added a comment - You're right Lars, setting the URIEncoding didn't work for Tomcat. I checked in a test program: solr/example/exampledocs/test_utf8.sh It seems that using Content-Type: application/x-www-form-urlencoded; charset=UTF-8 works for Jetty, Tomcat (I tested 5.5), and Resin (I tested 3.1) On a related note, I checked in a fix for distributed faceting refinement to ignore facet.query values that it doesn't know about. It's unfortunate that it will hide this problem (that's why i made the UTF8 test script), but it seems like the correct thing to do since another component may add additional request parts.
          Hide
          Lars Kotthoff added a comment -

          I can confirm that setting the content type manually to "application/x-www-form-urlencoded; charset=UTF-8" works, but that seems like a dirty hack to me. There's no standard/specification/.. covering that.

          In any case, I'd be ok with either setting the content type manually to something with a UTF-8 charset or putting all parameters in a multi-part POST, albeit the first one just working because everybody happened to implement it this way.

          To be honest I'm not too happy about ignoring unknown facet values because this will produce incorrect facet counts when something goes wrong. In which case would other components add additional facet.query parameters?

          Show
          Lars Kotthoff added a comment - I can confirm that setting the content type manually to "application/x-www-form-urlencoded; charset=UTF-8" works, but that seems like a dirty hack to me. There's no standard/specification/.. covering that. In any case, I'd be ok with either setting the content type manually to something with a UTF-8 charset or putting all parameters in a multi-part POST, albeit the first one just working because everybody happened to implement it this way. To be honest I'm not too happy about ignoring unknown facet values because this will produce incorrect facet counts when something goes wrong. In which case would other components add additional facet.query parameters?
          Hide
          Yonik Seeley added a comment -

          I can confirm that setting the content type manually to "application/x-www-form-urlencoded; charset=UTF-8" works, but that seems like a dirty hack to me. There's no standard/specification/.. covering that.

          I agree it's a bit hackish... but that's the state of things. I'm more concerned if it actually works everywhere (and I was surprised that it seems to). I imagine in the future, UTF-8 will be the standard... there's no getting around it unless one want's to just ban x-www-form-urlencoded POST for non-ascii, and that doesn't seem reasonable.

          I started using POST because the queries could go over the size limits of GET (so that's yet another hack). Using multi-part would really blow up the size of these requests, and could actually become a bottleneck when the number of servers is high.

          Show
          Yonik Seeley added a comment - I can confirm that setting the content type manually to "application/x-www-form-urlencoded; charset=UTF-8" works, but that seems like a dirty hack to me. There's no standard/specification/.. covering that. I agree it's a bit hackish... but that's the state of things. I'm more concerned if it actually works everywhere (and I was surprised that it seems to). I imagine in the future, UTF-8 will be the standard... there's no getting around it unless one want's to just ban x-www-form-urlencoded POST for non-ascii, and that doesn't seem reasonable. I started using POST because the queries could go over the size limits of GET (so that's yet another hack). Using multi-part would really blow up the size of these requests, and could actually become a bottleneck when the number of servers is high.
          Hide
          Lars Kotthoff added a comment -

          I agree that using multi-part increases the size of the requests significantly, but I don't think that it's going to be much of a problem.

          For example, consider SOLR-303. The requests for facet refinements use a large number of facet queries, so those would become significantly bigger. This is only really going to impact performance on the network interface of the machine sending the requests. The responses still come back in the old format, and creating a multi-part POST request isn't more expensive that creating a normal one. So the request would take a longer time to transmit, and the shards probably need more processing time to assemble the parts. I'd be surprised if the increase in processing time has any measurable impact on performance. As for network connectivity, even with multi-part requests for many facets we're talking about sizes of in the order of some 100kB. Unless the increase in size actually saturates the network connection (which won't happen until several 100 shards) the penalty will be some milliseconds more delay.

          It certainly seems inefficient and wasteful to use multi-part requests, but I don't think that the actual performance penalty is going to be significant. AFAIK the requests send like this by Solr are small anyway. I'll try to do some experiments to be able to give some hard numbers.

          Show
          Lars Kotthoff added a comment - I agree that using multi-part increases the size of the requests significantly, but I don't think that it's going to be much of a problem. For example, consider SOLR-303 . The requests for facet refinements use a large number of facet queries, so those would become significantly bigger. This is only really going to impact performance on the network interface of the machine sending the requests. The responses still come back in the old format, and creating a multi-part POST request isn't more expensive that creating a normal one. So the request would take a longer time to transmit, and the shards probably need more processing time to assemble the parts. I'd be surprised if the increase in processing time has any measurable impact on performance. As for network connectivity, even with multi-part requests for many facets we're talking about sizes of in the order of some 100kB. Unless the increase in size actually saturates the network connection (which won't happen until several 100 shards) the penalty will be some milliseconds more delay. It certainly seems inefficient and wasteful to use multi-part requests, but I don't think that the actual performance penalty is going to be significant. AFAIK the requests send like this by Solr are small anyway. I'll try to do some experiments to be able to give some hard numbers.
          Hide
          Lars Kotthoff added a comment -

          I've just done some tests with curl and a servlet that does nothing but parse the request parameters on Tomcat 5.5. POSTing a 48KB file as a single part takes about 13ms and generates about 50KB of traffic. Almost all of that time is spent processing at the client, i.e. executing curl and assembling the request. POSTing the same file as a multi-part request with 1 part per line (6318 parts total) takes about 80ms and generates about 650KB of traffic. About half of that time is spent at the client assembling the request.

          The time was measured at the client and is the total time required for everything – curl assembles the request, sends it to the server, the servlet parses the parameters, generates a dummy page, and sends it back. Client and server are connected with Gigabit ethernet.

          In conclusion, yes, the overhead is significant, but even with large requests it's nowhere near to being a bottleneck. Processing more than 6000 queries is going to take significantly longer than 80ms

          But YMMV of course.

          Show
          Lars Kotthoff added a comment - I've just done some tests with curl and a servlet that does nothing but parse the request parameters on Tomcat 5.5. POSTing a 48KB file as a single part takes about 13ms and generates about 50KB of traffic. Almost all of that time is spent processing at the client, i.e. executing curl and assembling the request. POSTing the same file as a multi-part request with 1 part per line (6318 parts total) takes about 80ms and generates about 650KB of traffic. About half of that time is spent at the client assembling the request. The time was measured at the client and is the total time required for everything – curl assembles the request, sends it to the server, the servlet parses the parameters, generates a dummy page, and sends it back. Client and server are connected with Gigabit ethernet. In conclusion, yes, the overhead is significant, but even with large requests it's nowhere near to being a bottleneck. Processing more than 6000 queries is going to take significantly longer than 80ms But YMMV of course.
          Hide
          Gunnar Wagenknecht added a comment -

          So what about making this configurable? It looks like the server side allows both ways. It looks to me that Content-Type: application/x-www-form-urlencoded; charset=UTF-8 basically works but there is no 100% guarantee. On the other hand, multi-part POSTs have a guarantee but come with a performance penalty. I think it would be fair to document both option and let the API client decide which one would better fit his use case.

          Show
          Gunnar Wagenknecht added a comment - So what about making this configurable? It looks like the server side allows both ways. It looks to me that Content-Type: application/x-www-form-urlencoded; charset=UTF-8 basically works but there is no 100% guarantee. On the other hand, multi-part POSTs have a guarantee but come with a performance penalty. I think it would be fair to document both option and let the API client decide which one would better fit his use case.
          Hide
          Lars Kotthoff added a comment -

          Attaching new patch which makes it configurable through a constructor parameter whether to use single-part POSTs and setting the content type to "application/x-www-form-urlencoded; charset=UTF-8" or use multi-part POSTs. Single-part is the default.

          Note that this patch changes the current behaviour for requests with streams. When content streams are present in the request, multi-part requests are always used. This is because the request has to have mutiple parts and we therefore cannot specify the content type. For multi-part POST requests a boundary between the parts has to be specified in the Content-Type header, but this is unknown when assembling the request, thus the Content-Type header cannot be set.

          Show
          Lars Kotthoff added a comment - Attaching new patch which makes it configurable through a constructor parameter whether to use single-part POSTs and setting the content type to "application/x-www-form-urlencoded; charset=UTF-8" or use multi-part POSTs. Single-part is the default. Note that this patch changes the current behaviour for requests with streams. When content streams are present in the request, multi-part requests are always used. This is because the request has to have mutiple parts and we therefore cannot specify the content type. For multi-part POST requests a boundary between the parts has to be specified in the Content-Type header, but this is unknown when assembling the request, thus the Content-Type header cannot be set.
          Hide
          Yonik Seeley added a comment -

          Committed. Thanks everyone!

          Show
          Yonik Seeley added a comment - Committed. Thanks everyone!

            People

            • Assignee:
              Unassigned
              Reporter:
              Andrew Schurman
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development