Bug 22666 - Entered non us-ascii symbols into the form appead wrong in JSP
Summary: Entered non us-ascii symbols into the form appead wrong in JSP
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 4
Classification: Unclassified
Component: Servlet & JSP API (show other bugs)
Version: 4.0.6 Final
Hardware: PC Windows XP
: P3 major (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-08-22 18:09 UTC by Stanislav Davydov
Modified: 2004-11-16 19:05 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stanislav Davydov 2003-08-22 18:09:10 UTC
My HTML have charset UTF-8.
I have a simple form with one input text field.
When I post into this text field non us-ascii symbols, like ö - Latin 
Small Letter O With Diaresis (ö), it comes to JSP paramaters in strange state.

I've following code in JSP:

request.setCharacterEncoding("UTF-8");
String [] texts = request.getParameterValues ("text");

As my form has only one text field "text", I should get String [1] array here. 
But I've got String [3] with strange consistent. 

I've add following code for look into the data:

    String [] texts = request.getParameterValues("text");
    for (int i = 0; i < texts.length; i++) {
      String value = texts[i];
      byte [] bytes = value.getBytes("UTF-8");
      for (int j = 0; j < bytes.length; j++) {
        byte aByte = bytes[j];
        System.out.print(aByte + ", ");
      }
      System.out.println("");
    }

And I got following:
-61, -125, -62, -125, -61, -126, -62, -125, -61, -125, -62, -126, -61, -126, -
62, -74, 

-61, -125, -62, -125, -61, -126, -62, -74, 

-61, -125, -62, -74, 

So each of three Strings contain wrong set of characters different from 
&#x00F6;

What's wrong with it?
Comment 1 Stanislav Davydov 2003-08-22 18:14:53 UTC
After the posting I seen in request notification in my mailbox following:

... skipped...
Small Letter O With Diaresis (Г¶), it comes to JSP paramaters in strange
state.
....skipped...

So (&#x00F6;) has been changed to (&#0393;&#00B6)
Comment 2 Bazza 2003-11-05 14:12:14 UTC
To me this looks like you are experiencing the same problem as 
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=2760

ie that you can't use setCharacterEncoding() inside a JSP because Jasper has
already checked the value of jsp_precompile, and the spec says you can't set the
encoding after checking the value of a parameter.

I'm not convinced by the solution proposed there, however. Tomcat appears to use
the same character encoding for both the requested URL and POSTed parameters,
which is incorrect. 

Specifically this code in HttpRequestBase:
// Parse any parameters specified in the query string
        String queryString = getQueryString();
        try {
            RequestUtil.parseParameters(results, queryString, encoding);
        } catch (UnsupportedEncodingException e) {
            ;
        }
should be:
// Parse any parameters specified in the query string
        String queryString = getQueryString();
        try {
            RequestUtil.parseParameters(results, queryString, "UTF-8");
        } catch (UnsupportedEncodingException e) {
            ;
        }
because the servlet spec does /not/ say to use the users encoding for request
URLs; it says it is to "parse POST data" (servlet 2.3 sec 4.9); and the HTTP
spec says the encoding for URLs is UTF8.

This problem is exacerbated for users of the portlet spec. With portlets,
typical usage will have parameters in form actions (e.g. a target portlet window
id) as well as in the body of POSTed forms. The form action URL is encoded on
the server (should use UTF8), while the POSTed parameters sent with it use the
browser's default encoding, which may not be the same. So attempting to set the
encoding with a filter won't work, because the incoming request has two
encodings at work. If tomcat decoded URLs differently from request bodies there
would be no problem.
Comment 3 Larry Isaacs 2003-11-05 14:43:09 UTC
FYI: I checked, and Craig implemented a fix for Bug #2760 prior to the original
release of Tomcat 4.0.
Comment 4 Bazza 2003-11-05 15:07:08 UTC
Larry, I'd be happy to see this bug closed then, but the bug with the request
encoding being used to decode URLs is there right now in CVS. Do you want me to
raise this separately and close this bug?
Comment 5 Bazza 2003-11-05 15:12:43 UTC
Sorry I shouldn't have talked about closing this bug, since obviously Stanislav
still sees something wrong --- but the request encoding bug hit us today (arabic
win2k client, english win2k server; we had been using URLEncoder.encodeURL(blah,
"UTF-8"), now we filter just get requests to set encoding to UTF8 - however the
bug would prevent us from using this client & server with portlets, as described)
Comment 6 Stanislav Davydov 2003-11-19 13:16:18 UTC
I've solved this issue by using iso-8859-1 encoding of my pages. And when some 
unicode data comes from form in format &#number;, I decode it to real unicode 
string with request wrapper. It's not a pretty good solution, but it's 
working :)
Comment 7 Mark Thomas 2004-02-25 23:08:54 UTC
There have been a number of updates to TC4 and TC5 since this report was filed 
to address this type of issue. The updates are described below. As far as I 
can tell, the issues described in this report should now be resolved.

There are a number of situations where there may be a requirement to use non-
US ASCII characters in a URI. These include:
- Parameters in the query string
- Servlet paths

There is a standard for encoding URIs (http://www.w3.org/International/O-URL-
code.html) but this standard is not consistently followed by clients. This 
causes a number of problems.

The functionality provided by Tomcat (4 and 5) to handle this less than ideal 
situation is described below.

1. The Coyote HTTP/1.1 connector has a useBodyEncodingForURI attribute which 
if set to true will use the request body encoding to decode the URI query 
parameters.
  - The default value is true for TC4 (breaks spec but gives consistent 
behaviour across TC4 versions)
  - The default value is false for TC5 (spec compliant but there may be 
migration issues for some apps)
2. The Coyote HTTP/1.1 connector has a URIEncoding attribute which defaults to 
ISO-8859-1.
3. The parameters class (o.a.t.u.http.Parameters) has a QueryStringEncoding 
field which defaults to the URIEncoding. It must be set before the parameters 
are parsed to have an effect.

Things to note regarding the servlet API:
1. HttpServletRequest.setCharacterEncoding() normally only applies to the 
request body NOT the URI.
2. HttpServletRequest.getPathInfo() is decoded by the web container.
3. HttpServletRequest.getRequestURI() is not decoded by container.

Other tips:
1. Use POST with forms to return parameters as the parameters are then part of 
the request body.