My HTML have charset UTF-8. I have a simple form with one input text field. When I post into this text field non us-ascii symbols, like ö - Latin Small Letter O With Diaresis (ö), it comes to JSP paramaters in strange state. I've following code in JSP: request.setCharacterEncoding("UTF-8"); String [] texts = request.getParameterValues ("text"); As my form has only one text field "text", I should get String [1] array here. But I've got String [3] with strange consistent. I've add following code for look into the data: String [] texts = request.getParameterValues("text"); for (int i = 0; i < texts.length; i++) { String value = texts[i]; byte [] bytes = value.getBytes("UTF-8"); for (int j = 0; j < bytes.length; j++) { byte aByte = bytes[j]; System.out.print(aByte + ", "); } System.out.println(""); } And I got following: -61, -125, -62, -125, -61, -126, -62, -125, -61, -125, -62, -126, -61, -126, - 62, -74, -61, -125, -62, -125, -61, -126, -62, -74, -61, -125, -62, -74, So each of three Strings contain wrong set of characters different from ö What's wrong with it?
After the posting I seen in request notification in my mailbox following: ... skipped... Small Letter O With Diaresis (Г¶), it comes to JSP paramaters in strange state. ....skipped... So (ö) has been changed to (Ɖ�B6)
To me this looks like you are experiencing the same problem as http://nagoya.apache.org/bugzilla/show_bug.cgi?id=2760 ie that you can't use setCharacterEncoding() inside a JSP because Jasper has already checked the value of jsp_precompile, and the spec says you can't set the encoding after checking the value of a parameter. I'm not convinced by the solution proposed there, however. Tomcat appears to use the same character encoding for both the requested URL and POSTed parameters, which is incorrect. Specifically this code in HttpRequestBase: // Parse any parameters specified in the query string String queryString = getQueryString(); try { RequestUtil.parseParameters(results, queryString, encoding); } catch (UnsupportedEncodingException e) { ; } should be: // Parse any parameters specified in the query string String queryString = getQueryString(); try { RequestUtil.parseParameters(results, queryString, "UTF-8"); } catch (UnsupportedEncodingException e) { ; } because the servlet spec does /not/ say to use the users encoding for request URLs; it says it is to "parse POST data" (servlet 2.3 sec 4.9); and the HTTP spec says the encoding for URLs is UTF8. This problem is exacerbated for users of the portlet spec. With portlets, typical usage will have parameters in form actions (e.g. a target portlet window id) as well as in the body of POSTed forms. The form action URL is encoded on the server (should use UTF8), while the POSTed parameters sent with it use the browser's default encoding, which may not be the same. So attempting to set the encoding with a filter won't work, because the incoming request has two encodings at work. If tomcat decoded URLs differently from request bodies there would be no problem.
FYI: I checked, and Craig implemented a fix for Bug #2760 prior to the original release of Tomcat 4.0.
Larry, I'd be happy to see this bug closed then, but the bug with the request encoding being used to decode URLs is there right now in CVS. Do you want me to raise this separately and close this bug?
Sorry I shouldn't have talked about closing this bug, since obviously Stanislav still sees something wrong --- but the request encoding bug hit us today (arabic win2k client, english win2k server; we had been using URLEncoder.encodeURL(blah, "UTF-8"), now we filter just get requests to set encoding to UTF8 - however the bug would prevent us from using this client & server with portlets, as described)
I've solved this issue by using iso-8859-1 encoding of my pages. And when some unicode data comes from form in format &#number;, I decode it to real unicode string with request wrapper. It's not a pretty good solution, but it's working :)
There have been a number of updates to TC4 and TC5 since this report was filed to address this type of issue. The updates are described below. As far as I can tell, the issues described in this report should now be resolved. There are a number of situations where there may be a requirement to use non- US ASCII characters in a URI. These include: - Parameters in the query string - Servlet paths There is a standard for encoding URIs (http://www.w3.org/International/O-URL- code.html) but this standard is not consistently followed by clients. This causes a number of problems. The functionality provided by Tomcat (4 and 5) to handle this less than ideal situation is described below. 1. The Coyote HTTP/1.1 connector has a useBodyEncodingForURI attribute which if set to true will use the request body encoding to decode the URI query parameters. - The default value is true for TC4 (breaks spec but gives consistent behaviour across TC4 versions) - The default value is false for TC5 (spec compliant but there may be migration issues for some apps) 2. The Coyote HTTP/1.1 connector has a URIEncoding attribute which defaults to ISO-8859-1. 3. The parameters class (o.a.t.u.http.Parameters) has a QueryStringEncoding field which defaults to the URIEncoding. It must be set before the parameters are parsed to have an effect. Things to note regarding the servlet API: 1. HttpServletRequest.setCharacterEncoding() normally only applies to the request body NOT the URI. 2. HttpServletRequest.getPathInfo() is decoded by the web container. 3. HttpServletRequest.getRequestURI() is not decoded by container. Other tips: 1. Use POST with forms to return parameters as the parameters are then part of the request body.