MyFaces Core
  1. MyFaces Core
  2. MYFACES-1396

International characters are not properly encoded to Mnemonic/Numeric values (Was: Too much escaping)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.1.5-SNAPSHOT
    • Fix Version/s: None
    • Component/s: General
    • Labels:
      None

      Description

      HTMLOutputText (which delegates to HTMLEncoder) escapes not only XML-invalid charactres (like <, >, &), but also german umlauts. This is OK if generating (X)HTML, but not OK if generating XML. However, according to the official documentation to the outputText Tag the german umlauts should not be quoted: If the "escape" attribute is not present, or it is present and its value is "true" all angle brackets should be converted to the ampersand xx semicolon syntax when rendering the value of the "value" attribute as the value of the component.

      There is an automatic XML detection, but this is broken, as only predefined MIME-types are recognized (application/xhtml+xml, application/xml, text/xml).

      This bug prevents using JSF for generating other content (e.g. SVG, MIME-type image/svg+xml).

      1. test.jsf
        0.3 kB
        Paul Pogonyshev

        Activity

        Hide
        Mario Ivankovits added a comment -

        Is it possible to deliver the content as UTF-8? As far as I remember then no escaping of german umlauts takes place.

        Ciao,
        Mario

        Show
        Mario Ivankovits added a comment - Is it possible to deliver the content as UTF-8? As far as I remember then no escaping of german umlauts takes place. Ciao, Mario
        Hide
        Tomas Fischer added a comment -

        You are right, that the UnicodeEncoder escapes all characters >= 0x80 as XML entities &#xxx; (however it doesn't encode <, > and & which is probably another bug), but this encoder is used only if in script or style (see HtmlResponseWriterImpl). Otherwise the HTMLEncoder is used.

        Show
        Tomas Fischer added a comment - You are right, that the UnicodeEncoder escapes all characters >= 0x80 as XML entities &#xxx; (however it doesn't encode <, > and & which is probably another bug), but this encoder is used only if in script or style (see HtmlResponseWriterImpl). Otherwise the HTMLEncoder is used.
        Hide
        Paul Pogonyshev added a comment -

        I also stumbled into this bug. Except for XML, it is also makes it very difficult to use generated strings in JavaScript, since it does not convert entities to characters automatically (you have to do it manually and it is very inconvenient to do for each string, not to mention it is improper.)

        Please don't escape encodable characters by default or add an option to not escape them somewhere in the configuration file.

        I attach a simple test page with Cyrillic characters generated in various ways. All are converted to entities here (1.1.4.)

        Show
        Paul Pogonyshev added a comment - I also stumbled into this bug. Except for XML, it is also makes it very difficult to use generated strings in JavaScript, since it does not convert entities to characters automatically (you have to do it manually and it is very inconvenient to do for each string, not to mention it is improper.) Please don't escape encodable characters by default or add an option to not escape them somewhere in the configuration file. I attach a simple test page with Cyrillic characters generated in various ways. All are converted to entities here (1.1.4.)
        Hide
        Paul Pogonyshev added a comment -

        Test case with Cyrillic letters converted to entities.

        Show
        Paul Pogonyshev added a comment - Test case with Cyrillic letters converted to entities.
        Hide
        Paul Pogonyshev added a comment -

        Also, I took all suggested measures to generate UTF-8 contents, but this didn't help.

        Show
        Paul Pogonyshev added a comment - Also, I took all suggested measures to generate UTF-8 contents, but this didn't help.
        Hide
        Paul Pogonyshev added a comment -

        I have fixed this locally since we have a release soon and it is not fixed in upstream (stable) versions.

        I spent 1.5 days debugging stuff. It all went down to UTF-8 charset suddenly spelled as "UTF8" (while it was with hyphen at earlier stages.) I have no idea why the change happened, I just fixed it by converting "UTF8" to "UTF-8" in HtmlResponseWriterImpl class.

        BTW, you have a really weird package structure. For some reason, there is a shared_tomahawk package, but it seems identical to shared_impl and even not present in the sources. This made my debugging three times longer than it could be...

        Show
        Paul Pogonyshev added a comment - I have fixed this locally since we have a release soon and it is not fixed in upstream (stable) versions. I spent 1.5 days debugging stuff. It all went down to UTF-8 charset suddenly spelled as "UTF8" (while it was with hyphen at earlier stages.) I have no idea why the change happened, I just fixed it by converting "UTF8" to "UTF-8" in HtmlResponseWriterImpl class. BTW, you have a really weird package structure. For some reason, there is a shared_tomahawk package, but it seems identical to shared_impl and even not present in the sources. This made my debugging three times longer than it could be...
        Hide
        Paul Spencer added a comment -

        Paul,
        Please submit a patch or describe "I just fixed it by converting "UTF8" to "UTF-8" in HtmlResponseWriterImpl class.". I just checked org.apache.myfaces.shared.renderkit.html.HtmlResponceWriterImpl in the shared project. It has been UTF-8 for a while, so I am not sure what you fixed.

        Paul Spencer

        Show
        Paul Spencer added a comment - Paul, Please submit a patch or describe "I just fixed it by converting "UTF8" to "UTF-8" in HtmlResponseWriterImpl class.". I just checked org.apache.myfaces.shared.renderkit.html.HtmlResponceWriterImpl in the shared project. It has been UTF-8 for a while, so I am not sure what you fixed. Paul Spencer
        Hide
        Paul Pogonyshev added a comment -

        I didn't create a patch since I didn't feel it is a proper fix for upstream version. And now I can't create one since SVN checkout commands on your site are broken.

        Anyway, let me describe it in more words:

        • when instance of org.apache.myfaces.shared_tomahawk.renderkit.html.HtmlResponseWriterImpl is created (note: shared_tomahawk, not shared_impl!), it is passed "UTF8", without hyphen, as `characterEncoding';
        • in all (or at least all relevant) cases before, charset is "UTF-8", with hyphen, as expected; in particular this is true for org.apache.myfaces.shared_tomahawk.renderkit.html.HtmlResponseWriterImpl (note: shared_impl, not shared_tomahawk);
        • I fixed it by converting "UTF8" string to "UTF-8" in HtmlResponseWriterImpl constructor;
        • a proper fix would be find out why charset becomes "UTF8", without hyphen, in the first place; ad-hoc fix above could be included too, as a way to make HtmlResponseWriterImpl more robust.
        Show
        Paul Pogonyshev added a comment - I didn't create a patch since I didn't feel it is a proper fix for upstream version. And now I can't create one since SVN checkout commands on your site are broken. Anyway, let me describe it in more words: when instance of org.apache.myfaces.shared_tomahawk.renderkit.html.HtmlResponseWriterImpl is created (note: shared_tomahawk, not shared_impl!), it is passed "UTF8", without hyphen, as `characterEncoding'; in all (or at least all relevant) cases before, charset is "UTF-8", with hyphen, as expected; in particular this is true for org.apache.myfaces.shared_tomahawk.renderkit.html.HtmlResponseWriterImpl (note: shared_impl, not shared_tomahawk); I fixed it by converting "UTF8" string to "UTF-8" in HtmlResponseWriterImpl constructor; a proper fix would be find out why charset becomes "UTF8", without hyphen, in the first place; ad-hoc fix above could be included too, as a way to make HtmlResponseWriterImpl more robust.
        Hide
        Paul Pogonyshev added a comment -

        Eh, I meant org.apache.myfaces.shared_impl.renderkit.html.HtmlResponseWriterImpl for the second list item. Anyway, that is mentioned in parenthesis.

        Show
        Paul Pogonyshev added a comment - Eh, I meant org.apache.myfaces.shared_impl.renderkit.html.HtmlResponseWriterImpl for the second list item. Anyway, that is mentioned in parenthesis.
        Hide
        Paul Spencer added a comment -

        Paul,
        I have search the source code for "UTF8", but found nothing. Like you said, the charset is being passed into HtmlResponseWriterImpl, have you verifed that MyFaces is getting the charset "UTF8" from you browser or source code, including JSP?

        Paul Spencer

        Show
        Paul Spencer added a comment - Paul, I have search the source code for "UTF8", but found nothing. Like you said, the charset is being passed into HtmlResponseWriterImpl, have you verifed that MyFaces is getting the charset "UTF8" from you browser or source code, including JSP? Paul Spencer
        Hide
        Paul Pogonyshev added a comment -

        Note from the code, there's nothing like that (I actually grepped all source tree.)

        It may get it from the browser, I don't know. However, in this case it is very wrong: 1) at least MyFaces must handle "UTF8" just like "UTF-8"; 2) browser must not determine encoding of pages, since it is impossible to reencode pages robustly; in particular, inserting HTML entities into unsuspecting JavaScript (as in my case) will break things.

        Show
        Paul Pogonyshev added a comment - Note from the code, there's nothing like that (I actually grepped all source tree.) It may get it from the browser, I don't know. However, in this case it is very wrong: 1) at least MyFaces must handle "UTF8" just like "UTF-8"; 2) browser must not determine encoding of pages, since it is impossible to reencode pages robustly; in particular, inserting HTML entities into unsuspecting JavaScript (as in my case) will break things.
        Hide
        Paul Spencer added a comment -

        Paul,
        Can you verify what you browser is sending, i.e. UTF8 or UTF-8?

        What is the borwser?

        What is the default language?

        Paul Spencer

        Show
        Paul Spencer added a comment - Paul, Can you verify what you browser is sending, i.e. UTF8 or UTF-8? What is the borwser? What is the default language? Paul Spencer
        Hide
        Paul Pogonyshev added a comment -

        1) No. I'm too tired now to find how to do it. 2) Firefox 2.0. 3) English (US).

        Show
        Paul Pogonyshev added a comment - 1) No. I'm too tired now to find how to do it. 2) Firefox 2.0. 3) English (US).
        Hide
        Paul Spencer added a comment -

        I have used the tcpmon from the Apache Axis project
        http://ws.apache.org/axis/java/user-guide.html#AppendixUsingTheAxisTCPMonitorTcpmon

        Paul Spencer

        Show
        Paul Spencer added a comment - I have used the tcpmon from the Apache Axis project http://ws.apache.org/axis/java/user-guide.html#AppendixUsingTheAxisTCPMonitorTcpmon Paul Spencer
        Hide
        Paul Pogonyshev added a comment -

        And let me clarify why I think browser should have absolutely no saying in the a resulting encoding.

        Nowadays all browser should be able to handle any encoding just fine, as long as it is state in HTML page header. If a browser fails to handle a particular encoding, you should upgrade it, else throw it away. Nothing I know of can reencode pages on the fly, so MyFaces seems to invent a wheel that is absolutely unneeded. In fact you seem to encourage browser behaviour which will not work with other server-side solutions, especially if a server just contains a number of static HTMLs. I can also confirm that it works perfectly without reencoding in JSP parts of the site.

        Show
        Paul Pogonyshev added a comment - And let me clarify why I think browser should have absolutely no saying in the a resulting encoding. Nowadays all browser should be able to handle any encoding just fine, as long as it is state in HTML page header. If a browser fails to handle a particular encoding, you should upgrade it, else throw it away. Nothing I know of can reencode pages on the fly, so MyFaces seems to invent a wheel that is absolutely unneeded. In fact you seem to encourage browser behaviour which will not work with other server-side solutions, especially if a server just contains a number of static HTMLs. I can also confirm that it works perfectly without reencoding in JSP parts of the site.
        Hide
        Paul Spencer added a comment -

        Paul,
        I would like to be create the problem so it can be addressed. Their are many places where the charset can be set, including the browser and MyFaces tags. At this point I do not know where the charset is set to UTF8.

        Where/are you seeing UTF8 in any of the pages generated by MyFaces?

        Paul Spencer

        Show
        Paul Spencer added a comment - Paul, I would like to be create the problem so it can be addressed. Their are many places where the charset can be set, including the browser and MyFaces tags. At this point I do not know where the charset is set to UTF8. Where/are you seeing UTF8 in any of the pages generated by MyFaces? Paul Spencer
        Hide
        Paul Pogonyshev added a comment -

        I don't actively see it anywhere. However, I did see it in org.apache.myfaces.shared_tomahawk.renderkit.html.HtmlResponseWriterImpl constructor and it caused all non-ASCII characters be replaced with HTML entities. The entities were also seen by my work neighbor and my client, but AFAIK we all use Firefox. And replacing "UTF8" string with "UTF-8" in this function did solve the problem, so it was indeed a (non-direct) cause.

        I suggest that you try the small test page I attached. You can also make it available somewhere on a (test) MyFaces server and then I can test it with my browser.

        I searched the whole source tree for "UTF8" again. It is not present in any configuration, Java or JSP/JSF files except in Java comments on few occasions. And in local copy of HtmlResponseWriterImpl.java, of course.

        Show
        Paul Pogonyshev added a comment - I don't actively see it anywhere. However, I did see it in org.apache.myfaces.shared_tomahawk.renderkit.html.HtmlResponseWriterImpl constructor and it caused all non-ASCII characters be replaced with HTML entities. The entities were also seen by my work neighbor and my client, but AFAIK we all use Firefox. And replacing "UTF8" string with "UTF-8" in this function did solve the problem, so it was indeed a (non-direct) cause. I suggest that you try the small test page I attached. You can also make it available somewhere on a (test) MyFaces server and then I can test it with my browser. I searched the whole source tree for "UTF8" again. It is not present in any configuration, Java or JSP/JSF files except in Java comments on few occasions. And in local copy of HtmlResponseWriterImpl.java, of course.
        Hide
        Paul Spencer added a comment -

        Paul,
        The JSP, test.jsp, includes "/base/taglibInclude.jsp", but that file is not attached to this issue. Is it needed for the test?

        Paul Spencer

        Show
        Paul Spencer added a comment - Paul, The JSP, test.jsp, includes "/base/taglibInclude.jsp", but that file is not attached to this issue. Is it needed for the test? Paul Spencer
        Hide
        Tomas Fischer added a comment -

        I didn't complain that the UTF-8 characters would be passed incorrectly, I did complain that the HTMLOutputText doesn't work properly.

        Documentation states: If the "escape" attribute is not present, or it is present and its value is "true" all angle brackets should be converted to the ampersand xx semicolon syntax when rendering the value of the "value" attribute as the value of the component. If the "escape" attribute is present and is "false" the value of the component should be rendered as text without escaping.

        <h:outputText value="äüöß" escape="true" /> outputs äüöß
        h:outputText value="äüöß" escape="false" /> outputs äüöß

        Both are incorrect according to the documentation.

        Show
        Tomas Fischer added a comment - I didn't complain that the UTF-8 characters would be passed incorrectly, I did complain that the HTMLOutputText doesn't work properly. Documentation states: If the "escape" attribute is not present, or it is present and its value is "true" all angle brackets should be converted to the ampersand xx semicolon syntax when rendering the value of the "value" attribute as the value of the component. If the "escape" attribute is present and is "false" the value of the component should be rendered as text without escaping. <h:outputText value="äüöß" escape="true" /> outputs äüöß h:outputText value="äüöß" escape="false" /> outputs äüöß Both are incorrect according to the documentation.
        Hide
        Paul Pogonyshev added a comment -

        Paul: sorry, no, it is not needed. Just accidentally left from a real page.

        Tomas: yes, I started a somewhat different discussion, but I believe your problem is caused by UTF-8 characters replaced with HTML entities due to charset being "UTF8", not "UTF-8".

        Show
        Paul Pogonyshev added a comment - Paul: sorry, no, it is not needed. Just accidentally left from a real page. Tomas: yes, I started a somewhat different discussion, but I believe your problem is caused by UTF-8 characters replaced with HTML entities due to charset being "UTF8", not "UTF-8".
        Hide
        Paul Spencer added a comment -

        When I add the following to the outputText test case:
        <h:outputText id="escape" escape="true" value="10 > 5" />
        <h:outputText value=" | " />
        <h:outputText id="notEscape" escape="false" value="10 > 5" />
        <h:outputText value=" | " />
        <h:outputText id="utf8charEscaped" value="äüöß" escape="true" />
        <h:outputText value=" | " />
        <h:outputText id="utf8charNotEscaped" value="äüöß" escape="false" />
        <h:outputText value=" | " />
        <h:outputText id="utf8char" value="äüöß" />
        <h:outputText value=" | " />
        <h:outputText id="utf8charInEscapedFormat" value="äüöß" escape="false" />

        I get the following output running MyFaces 1.1.5-SNAPSHOT
        <span id="escape">10 &gt; 5</span>

        <span id="notEscape">10 > 5</span>

        <span id="utf8charEscaped">????</span>

        <span id="utf8charNotEscaped">????</span>

        <span id="utf8char">????</span>

        <span id="utf8charInEscapedFormat">äüöß</span>

        I get the following output running Sun's RI
        <span id="escape">10 &gt; 5</span>

        <span id="notEscape">10 > 5</span>

        <span id="utf8charEscaped">����</span>

        <span id="utf8charNotEscaped">????</span>

        <span id="utf8char">����</span>

        <span id="utf8charInEscapedFormat">äüöß</span>

        So I see the following problem:
        Escaping, or not defining the escape attribute, incorrectly converts international characters to their numeric or mnemonic value.

        Show
        Paul Spencer added a comment - When I add the following to the outputText test case: <h:outputText id="escape" escape="true" value="10 > 5" /> <h:outputText value=" | " /> <h:outputText id="notEscape" escape="false" value="10 > 5" /> <h:outputText value=" | " /> <h:outputText id="utf8charEscaped" value="äüöß" escape="true" /> <h:outputText value=" | " /> <h:outputText id="utf8charNotEscaped" value="äüöß" escape="false" /> <h:outputText value=" | " /> <h:outputText id="utf8char" value="äüöß" /> <h:outputText value=" | " /> <h:outputText id="utf8charInEscapedFormat" value="äüöß" escape="false" /> I get the following output running MyFaces 1.1.5-SNAPSHOT <span id="escape">10 &gt; 5</span> <span id="notEscape">10 > 5</span> <span id="utf8charEscaped">????</span> <span id="utf8charNotEscaped">????</span> <span id="utf8char">????</span> <span id="utf8charInEscapedFormat">äüöß</span> I get the following output running Sun's RI <span id="escape">10 &gt; 5</span> <span id="notEscape">10 > 5</span> <span id="utf8charEscaped">����</span> <span id="utf8charNotEscaped">????</span> <span id="utf8char">����</span> <span id="utf8charInEscapedFormat">äüöß</span> So I see the following problem: Escaping, or not defining the escape attribute, incorrectly converts international characters to their numeric or mnemonic value.
        Hide
        Tomas Fischer added a comment -

        The main problem (for our project) ist that the escaping occurs at all. We need exactly the described behaviour - either no escaping at all or escaping the XML entities only.

        For generating HTML content escaping international characters -> numeric values might be OK, for generating XML content (MIME type xxx/yyy+xml) is may be inacceptable and should be disabled. Escaping international characters -> named entities is not needed at all (if the former is available) and is dangerous as not every browser understands every named entity.

        Show
        Tomas Fischer added a comment - The main problem (for our project) ist that the escaping occurs at all. We need exactly the described behaviour - either no escaping at all or escaping the XML entities only. For generating HTML content escaping international characters -> numeric values might be OK, for generating XML content (MIME type xxx/yyy+xml) is may be inacceptable and should be disabled. Escaping international characters -> named entities is not needed at all (if the former is available) and is dangerous as not every browser understands every named entity.
        Hide
        Paul Spencer added a comment -

        Thomas,

        I am not sure what the JSR says about escaping international characters. Martin Marinschek may be able to answer this question. We now have a test case that Martin, and the build process, can use to determine when this issue is resolved.

        Paul Spencer

        Show
        Paul Spencer added a comment - Thomas, I am not sure what the JSR says about escaping international characters. Martin Marinschek may be able to answer this question. We now have a test case that Martin, and the build process, can use to determine when this issue is resolved. Paul Spencer
        Hide
        Martin Marinschek added a comment -

        Hi Paul,

        I believe that MyFaces is indeed misbehaving here - it's probably the HtmlResponseWriter which is encoding too much.

        regards,

        Martin

        Show
        Martin Marinschek added a comment - Hi Paul, I believe that MyFaces is indeed misbehaving here - it's probably the HtmlResponseWriter which is encoding too much. regards, Martin
        Hide
        Manfred Geiler added a comment -

        Martin already confirmed that MyFacey is misbehaving here, but considering that the HtmlRenderkit was not designed for generating XML output in the first place, I cannot agree this beeing a "Blocker" issue. I change it to "Major". There should not be a 1.1.5 release delay because of this one.
        As soon as anyone provides a patch we will fix this of course.

        Show
        Manfred Geiler added a comment - Martin already confirmed that MyFacey is misbehaving here, but considering that the HtmlRenderkit was not designed for generating XML output in the first place, I cannot agree this beeing a "Blocker" issue. I change it to "Major". There should not be a 1.1.5 release delay because of this one. As soon as anyone provides a patch we will fix this of course.
        Hide
        Nick Belaevski added a comment -

        Escaping should be done conditionally, depending on the fact whether we're outputting script/style text or not.

        E.g.:

        \u00a0 should be represented as   for common text, but as \u00a0 for style/script tags body

        Show
        Nick Belaevski added a comment - Escaping should be done conditionally, depending on the fact whether we're outputting script/style text or not. E.g.: \u00a0 should be represented as   for common text, but as \u00a0 for style/script tags body

          People

          • Assignee:
            Martin Marinschek
            Reporter:
            Tomas Fischer
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development