Summary: | utf8 to ucs2 conversion failed on Windows | ||
---|---|---|---|
Product: | Apache httpd-2 | Reporter: | ernesto <ernestoname> |
Component: | mod_cgi | Assignee: | Apache HTTPD Bugs Mailing List <bugs> |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | gazerro, rd9 |
Priority: | P2 | ||
Version: | 2.1-HEAD | ||
Target Milestone: | --- | ||
Hardware: | PC | ||
OS: | Windows XP |
Description
ernesto
2005-05-20 15:12:07 UTC
Dude - if you are running mod_cgid on Win32 then all bets are off :) Reclassifying. And I'm totally clueless, but I guess my first question is why use php.exe as a CGI when you can plug it in as a module, and actually serve pages without warming up your cpu? CGI is a very disk/cpu/kernel intensive way to serve any content whatsoever. This looks like a variant of Bug 32730 which had the same issues on Windows with some different environment variables. The problem is that Apache tries to translate every environment variable from Unicode's UTF-8 encoding into UCS-2, even though the environment variable may be in another character encoding (e.g. ISO-8859-1 aka Latin-1). An extension of the fix for Bug 32730 should work, although the real solution This is not specific to mod_cgi and PHP, as it happens with non-PHP CGI programs. CGI is still a reasonable option in some cases, e.g. for development of CGI scripts on Windows for installation on Linux+CGI (or a production mod_perl server on any OS). Got interrupted when writing last comment, sorry... To finish the incomplete sentence in that comment: the real solution in my view is to go through all environment variables that could be non-UTF8 (virtually anything that is a string) and avoid converting those - or, better, only convert those guaranteed not to be strings, or guaranteed to be ASCII only. Another environment variable with this problem is REDIRECT_URL, logged in comment to Bug 32730 after fix was committed. This is a fairly simple extension of the patch I submitted for that bug. A configuration directive to turn off this conversion might also be useful. Some more variants of this bug... Bug 13029 is another variant for the environment variable SSL_SERVER_S_DN_L. I think the fundamental issue is that there's no way to turn off this UTF-8 to UCS-2 conversion, and it only happens on Windows, well before any CGI script or other code has a chance to do its own non-UTF-8 based conversion. The REDIRECT_QUERY_STRING variant was also reported at http://mail-archives.apache.org/mod_mbox/httpd-users/200504.mbox/%3c006901c536e0$3dd72010$5d01250a@vdm%3e Yes - it looks like this needs to be more tollerant, overall, of non-utf8 data, and I'll look at rolling in a solution that doesn't impact security. Thanks for your observations, they appear spot-on. Not sure what you mean by security implications, but I don't think that falling back to another encoding such as ISO-8859-1 is necessary. Taking TWiki as an example, which uses paths like /bin/view/Main/WebHome, where view is the CGI script, and /Main/WebHome is the PATH_INFO (see http://twiki.org/cgi-bin/viewfile/Support/ApacheErrorsDuringEdit?rev=1.1;filename=testenv.htm for example of CGI environment variables), it would be useful to specify the following to handle non-UTF-8 encodings such as ISO-8859-1 (which are used by POST from Firefox currently): AUTH_TYPE Raw DOCUMENT_ROOT Convert GATEWAY_INTERFACE Raw HTTP_ACCEPT Raw HTTP_ACCEPT_CHARSET Raw HTTP_ACCEPT_ENCODING Raw HTTP_ACCEPT_LANGUAGE Raw HTTP_CONNECTION Raw HTTP_HOST Raw HTTP_KEEP_ALIVE Raw HTTP_USER_AGENT Raw PATH Convert (since it has pathnames) QUERY_STRING Raw (not a filename, should be interpreted by application) REMOTE_ADDR Raw REMOTE_PORT Raw REMOTE_USER Raw REQUEST_METHOD Raw REQUEST_URI Convert if valid UTF-8 (and not overlong encoding) SCRIPT_FILENAME Convert if valid UTF-8 (and not overlong encoding) SCRIPT_NAME Convert if valid UTF-8 (and not overlong encoding) SERVER_ADDR Raw SERVER_ADMIN Raw .... (rest are all raw) Basically, only those variables that correspond to filenames should be converted, and then only if they are valid UTF-8 without overlong encoding. Any variables not used by Apache should not be converted, but left to the application, or a suitable add-on Apache module for conversion. TWiki has done its own interpretation of UTF-8 URLs, independent of the OS it is running on, which is based on a technique used by IBM's web server for mainframe (z/OS) - basically it tries to recognise the URL as UTF-8 and then falls back to the native encoding (i.e. no conversion done at all). In fact we do this on the PATH_INFO ourselves. If Apache is going to carry on doing its own UTF-8 to UCS-2 conversion, which I suppose it must do in some cases that map onto a Windows filesystem (and others such as MacOS X HFS+ etc), it would be good if it recognises when data is really UTF-8 in this way. Also, it would be very helpful to have a configuration option that lets you say "don't convert variable X if it matches regex Y", e.g. don't convert PATH_INFO if it matches "/twiki/bin/.*" Some TWiki pages that might be of interest here are: http://twiki.org/cgi-bin/view/Codev/EncodeURLsWithUTF8 - how TWiki does auto-detection and conversion of UTF-8 encoding for PATH_INFO in URLs http://twiki.org/cgi-bin/view/Codev/InternationalisationUTF8 - includes material on character set auto-detection including excerpt on IBM web server approach - fortunately UTF-8 detection is much easier than the general case. http://twiki.org/cgi-bin/view/Codev/MacOSXFilesystemEncodingWithI18N - talks about a filesystem-related issue with Unicode normalisation forms on Mac OS X http://twiki.org/cgi-bin/view/Codev/ProposedUTF8SupportForI18N - general page summarising research on UTF-8 for TWiki, including some useful links Hi all, We are implementing an application, that uses SSL client certificates. And it seems like we are running into the same problem that it descriped here: [Mon Mar 05 09:48:34 2007] [error] [client 195.7.31.10] File does not exist: C:/bec_was/servletpif/apache2/docroots/errordocs (22)Invalid argument: utf8 to ucs2 conversion failed on this string: SSL_CLIENT_S_DN_CN=Anette Birgitte Franzp\xf8tter Is there a way, that I can work around this problem ? Best regards Preben Nilsson |