Any PATH_INFO string that contains URL-encoded bytes that are not part of a valid UTF-8 sequence causes Apache 2.0.52 on Windows to give an Internal Server Error 500, and put the following message in the error.log: (22)Invalid argument: utf8 to ucs2 conversion failed on this string: PATH_INFO=/Main/FromageD\xe9rap\xe9 The URL that generated this was as follows ('view' being the CGI script, no mod_perl): http://localhost:8080/cgi-bin/view/Main/FromageD%E9rap%E9 Bug 9223 is similar to this bug, but not a dupe - it covered the QUERY_STRING which is mostly not used by the web application (TWiki, http://twiki.org). As you'd expect, the following URL works fine: http://localhost:8080/cgi-bin/view?topic=Main.FromageD%E9rap%E9 Most Mozilla-derived browsers including Firefox 1.0 generate URLs in the native character encoding (e.g. ISO-8859-1) by default. In any case, Apache should not be generating an internal server error, but a less serious error (e.g. file not found), allowing mod_fileiri or the web application to interpret the encoding correctly (which TWiki can do as long as it sees the PATH_INFO). This appears to be Windows specific since TWiki has users of internationalisation on Apache 2 and Linux - no doubt due to the Unicode on Windows support. I realise that such non-UTF-8 URLs are not standards conformant, but if the web application is willing to handle them specially, I think that Apache should at least pass them on without trying to convert them (a configuration option to turn off this conversion would be very useful.) This bug also prevents use of mod_fileiri, which enables such undesirable URLs to be redirected to conformant UTF-8 encoded URLs. As Martin Duerst has confirmed, this runs in the Apache 'fixup' phase. For more information and workarounds from a TWiki perspective, see http://twiki.org/cgi-bin/view/Codev/ApacheTwoBreaksNonUTF8EncodedURLsOnWindows
Some extra information: - this is completely reproducible, any URL using ISO-8859-1 encoded characters in PATH_INFO will do - the server build is from XAMPP for Windows 1.4.9 - Apache 2.0.52 - http://www.apachefriends.org/en/xampp-windows.html - server error page is as follows (mod_perl not active on CGI directory): Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, admin@localhost and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. Apache/2.0.52 (Win32) mod_perl/1.99_16 Perl/v5.8.4 PHP/5.0.2 Server at localhost Port 8080
Apache on Windows definitely needs to give back at least a 404 rather than a 500 for something like FromageD%E9rap%E9. This is a perfectly acceptable URI (although not one that the server should resolve, it should continue to only resolve with UTF-8), and so having the server blow up with a 500 is totally inappropriate. Just check for UTF-8 before doing the conversion or catch the error, and things should be fine. (if anybody tells me where in the code the conversion function is called, I'll try to help come up with a patch)
The Firefox 1.0 bug at http://bugzilla.mozilla.org/show_bug.cgi?id=261934 means that users can't persistently set Firefox to do UTF-8 encoding of URLs - there is a fix in place for future releases but no workaround for 1.0. So the 'configure Firefox' workaround is not very convenient for users.
I've had a look at the 2.0.52 code and the immediate issue appears to be in srclib/apr/threadproc/win32/proc.c at lines 473 to 502, or line 480 onwards in CVS (see http://lxr.webperf.org/source.cgi/srclib/apr/threadproc/win32/proc.c#480 ) - this is getting ready to create a Unicode environment block via the CreateProcess API in Win32. It is based on a compile time option, APR_HAS_UNICODE_FS. An immediate patch may be quite easy, using the APR_HAS_ANSI_FS code to build an ANSI environment block - or perhaps just recompiling with APR_HAS_ANSI_FS. A better fix might be to allow APR_HAS_ANSI_FS vs APR_HAS_UNICODE_FS to be selected through a run-time configuration directive. This would enable sites/applications that want environment variables to be completely untouched by the UTF8 to UCS2 conversion to run in 'no environment conversion' mode - the web application can then do its own conversion without having to catch 404 errors. Since there's no guarantee the environment passed to Apache will really be UTF-8, it seems that this 'no conversion' mode is important for some applications at least. Many web applications are portable between *nix and Windows, and have their own UTF8 URI handling code that works fine on both, so it's a pain if they have special code to handle the way Apache does things differently on Windows. If the 'no environment conversion' option is too sweeping, all that's needed is to add PATH_INFO to the list of environment variables not converted (as in Bug 9223) (Not sure where that code is, can someone point me to it?). However, I think that will cause future issues with other environment variables that may even be application-specific. Re-assigned to Will Rowe based on similarity to Bug 9223, hope that's OK.
No, that's no ok. Then he's the only one getting mail on updates. I'm adding him to CC instead. Anyway I'm wondering, what the FS type has to do with environment conversion. (But I have no much win32 fu).
Created attachment 13812 [details] Simple patch to prevent converting PATH_INFO to UCS-2 This is a very simple patch to modules/arch/win32/mod_win32.c that should prevent the PATH_INFO environment variable being converted from UTF-8 to UCS-2, as with QUERY_STRING (Bug 9223). I have not been able to test this yet - I downloaded various free Microsoft compilers and am almost ready, but missing MSDEV (any pointers to free versions, please email me). I'm on vacation until near end of the year, but if someone with a Windows build environment can test it with the above URL (omit the bin/view part if necessary) that would be great. I still think a wider patch would be better but this may fix the immediate issue.
Hi, I'm not using Twiki and I've run into the same problems: When using PHP's URL encoding function to access files with international characters a 404 file not found is returned. When first UTF-8 encoding the url and then URL encoding it - it works fine. Note the encoding differences in the URL: R%EAve and R%C3%AAve. Examples from log file: xxx.xxx.xxx.xxx - - [01/Jan/2005:18:23:03 +0100] "GET /Idir/Deux%20Rives%2C%20un%20R%EAve/01%20-%20Pourquoi%20cette%20pluie%20%20.mp3 HTTP/1.0" 404 260 "-" "WinampMPEG/2.9" "-" xxx.xxx.xxx.xxx - - [01/Jan/2005:21:12:16 +0100] "GET /Idir/Deux%20Rives%2C%20un%20R%C3%AAve/01%20-%20Pourquoi%20cette%20pluie%20%20.mp3 HTTP/1.0" 200 8164000 "-" "WinampMPEG/5.0" "-"
Can someone with an Apache for Windows development environment please consider testing my one-line patch (see attachment)? The test case is very simple so this should not take too long... I can't justify paying for a copy of Visual Studio just to get this tested, and without MSDEV the free version of Visual C++ doesn't work for Apache builds. Alternatively, if anyone has pointers to using Cygwin + mingw to do a Win32 build of Apache, that would be great, as I'm quite familiar with Cygwin already and it won't cost anything to test this.
This is a sane proposal. I'll attach a replacement libhttpd.dll (2.0.53-dev) to the incident once I have a chance to commit the fix to 2.1-dev.
Francis, "When using PHP's URL encoding function to access files" is unrelated to this incident (presuming you are using mod_php4 al la php4apache2handler module.)
Created attachment 13929 [details] libhttpd.dll 2.0.53-dev svn rev 124556 To test this patch, replace in the httpd-2.0.44 or later httpd-2.0 Apache/bin directory.
Hi, Maybe this comment does not belong here, but I'll try to explain more clearly. I used PHP to read the contents of a directory and to output the URL. When URL encoding the result there was a resulting 404 not found. When UTF8 encoding and then URL encoding the file was found. It might not be related - but then again - it might be.
I tried the new libhttpd.dll under my Apache setup and I got a somewhat different conversion error: (22)I(22)Invalid argument: utf8 to ucs2 conversion failed on this string: PATH_TRANSLATED=C:\\apachefriends\\xampp\\htdocs\\Main\\FromageD\xe9rap\xe9 This is happening later in the process as you can see, on the file not the URL as in the original bug. The URL used was http://localhost:8080/cgi-bin/view/Main/FromageD%E9rap%E9 as in the bug report. I know this is not a server or application config issue because http://localhost:8080/cgi-bin/view/Main/WebHome works fine even though C:\\apachefriends\\xampp\\htdocs\\Main\\WebHome does not exist as a file either. So it seems there is a file-existence check somewhere that is also requiring a translation of PATH_INFO even though the file doesn't have to exist.
Created attachment 13952 [details] Revised patch to avoid PATH_INFO and PATH_TRANSLATED conversions New patch that should work better... The analysis I submitted earlier today was wrong - it is the same issue as before with slightly different error message. This revised patch adds PATH_TRANSLATED to the list of environment variables not converted to UCS2 (note that REQUEST_URI also includes PATH_INFO data but is already covered by current code).
Thanks Richard, the patch is now in the httpd-2.1 tree and should soon be backported (provided we get the votes) to 2.0.
*** Bug 33055 has been marked as a duplicate of this bug. ***
This bug also occurs with REDIRECT_URL, which seem to have gone unnoticed in the 2.0.54 release!
Bug 34985 is another variant of this, have commented there.
*** This bug has been marked as a duplicate of 13029 ***