Bug 10385 - SSI-Servlet produces invalid character encoding information
Summary: SSI-Servlet produces invalid character encoding information
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 4
Classification: Unclassified
Component: Servlets:SSI (show other bugs)
Version: 4.1.24
Hardware: All All
: P3 normal with 4 votes (vote)
Target Milestone: ---
Assignee: Tomcat Developers Mailing List
URL: http://pa22.katowice.sdi.tpnet.pl:810...
Keywords:
: 36651 (view as bug list)
Depends on:
Blocks:
 
Reported: 2002-07-01 20:54 UTC by Jos
Modified: 2005-09-19 14:44 UTC (History)
1 user (show)



Attachments
tgzed org.apache.catalina.ssi src-package with quick fix (22.15 KB, application/octet-stream)
2003-06-27 09:15 UTC, Tomislaw Kitynski
Details
servlets-ssi.jar bin-package with quick fix (46.83 KB, application/octet-stream)
2003-06-27 09:22 UTC, Tomislaw Kitynski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jos 2002-07-01 20:54:06 UTC
On 2001-11-06 Bug 4674 was submitted

" ... the 'Content-Encoding' header SSI-Servlet generates has always the 
value 'UTF-8'. ... "


It was fixed on 2001-11-12

------- Additional Comments From Amy Roh 2001-11-12 16:36 -------
Fixed. Should be available in the next nightly build.

However, if you check the source code of this servlet on releases 4.03 and 4.04 
the code still shows 

Line 251:  res.setContentType("text/html;charset=UTF-8");
Comment 1 Remy Maucherat 2002-07-01 21:19:13 UTC
Nightlies != 4.0.x.
This bug won't be addressed in the 4.0.x releases, as they do not include the
refactored SSI code.

Please try the 4.1.6 test release instead to check the progress of the SSI code.
Comment 2 Adam W. 2003-05-27 07:09:33 UTC
I tried to use polish character encoding on page
http://pa22.katowice.sdi.tpnet.pl:23/fortune/example.shtml
and the result is - the characters in "Strona g³ówna" are displayed wrong - it 
looks they are converted to unicode.

The header is set properly
<meta http-equiv="content-type" content="text/html; charset=iso-8859-2"> and 
the characters are also ok - you can see source of the file in
http://pa22.katowice.sdi.tpnet.pl:8101/fortune/example.shtml.src

The line with res.setContentType("text/html;charset=UTF-8"); is still present 
in the code, propably this is the reason, but simple replacing it with my 
desired encoding doesnt solve the problem completely.

What should I do to make it work at least temporary?
Comment 3 Tomislaw Kitynski 2003-06-26 15:57:06 UTC
I haven't found an existing solution to this problem, so I played a bit with 
the source and I have working fix for that.

First of all I am not very familiar with the procedure of applying patches to 
CVS (I mean I don't know if shall I report it before commiting anything or ask 
for a permission or anything else), so I didn't put it into the repository. 
Instead I will give out the source and/or binaries if somebody asks. I'll be 
happy if the patches would hit the repository anyway.

Okay, here's the trick: now SSIServlet handles two more init-parameters, ie. 
defaultInputEncoding and defaultOutputEncoding. First one tells the SSIInclude 
command to treat all processed (and included) files as they were written in 
this charset (by creating appriopriate readers). The second sets Content-
Type's charset attribute to given value and thus allow to create proper writer.

This forced me to add two methods to SSIExternalResolver interface: 
getDefaultInputEncoding and getDefaultOutputEncoding. Both return objects of 
the type java.nio.charset.Charset, that hold appropriate charsets.

If happens, that certain included file is in different charset than the rest, 
then it's charset can be entered after the file name. I was thinking of using 
separate parameter, but it would break NCSA standard, besides <!--#include> 
command allows any number of file/virtual parameters, so it would have to be 
written like this: <!--#include file="foo.txt" charset="iso-8859-2" 
file="bar.txt" charset="iso-8859-1"--> and so on. Well, maybe it's not bad, 
but as I've written, it breaks NCSA standard. So instead I've used the same 
syntax as in mail headers. So now we shall write: <!--#include 
file="foo.txt;charset=iso-8859-2" file="bar.txt; charset = iso-8859-1"--> 
a.s.o. I hope this will not break any rule, and I know---it's questionable.

This, however, solves my problems with incorrect output, and if we have all 
the files in the same charset, we do not have to use "...;charset=X" 
construction (to be honest, I haven't tested the charset stuff just mentioned).

Default encodings works however flawlessly. If anyone is interrested in this 
patch, please contact me. If Tomcat developers find this patch usefull or not 
too dirty/nasty, then I gladly add my .02 to the contribution.
Comment 4 Tomislaw Kitynski 2003-06-27 08:55:04 UTC
I misused WORKSFORME resolution (ehh, I should have read FM ;-), so I am 
setting the status back to REOPEN. I'm sorry, guys!
Comment 5 Tomislaw Kitynski 2003-06-27 09:15:28 UTC
Created attachment 7011 [details]
tgzed org.apache.catalina.ssi src-package with quick fix
Comment 6 Tomislaw Kitynski 2003-06-27 09:22:31 UTC
Created attachment 7012 [details]
servlets-ssi.jar bin-package with quick fix
Comment 7 Mark Thomas 2005-04-03 21:07:01 UTC
I have committed a partial fix for TC4.1.x that should resolve the original issue.

I am still looking at the attached patches.
Comment 8 Mark Thomas 2005-04-04 22:59:37 UTC
I have not committed the proposed patch as it would introduce Tomcat specific
SSI syntax. Whilst there is no offical SSi spec I do not believe that Tomcat
should differ from the SSI syntax supported by Apache Web Server.

I have committed an alternative patch that introduces 2 new servlet parameters.
See the docs/source for details.

I have also ported the changes to TC5.5.x
Comment 9 Mark Thomas 2005-09-19 22:44:23 UTC
*** Bug 36651 has been marked as a duplicate of this bug. ***