XalanJ2
  1. XalanJ2
  2. XALANJ-611

HTML output serializes ampersand as "&" in HREF attributes

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.7
    • Fix Version/s: 2.7.1
    • Component/s: Serialization
    • Labels:
      None
    • Environment:
      Operating System: All
      Platform: All

      Description

      Using the HTML output method, an ampersand is serialized as "&". It should
      be "&".

        Activity

        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12571622 ] jira [ 12595227 ]
        Mark Thomas made changes -
        Workflow jira [ 35515 ] Default workflow, editable Closed status [ 12571622 ]
        Hide
        Brian Minchau added a comment -

        Would the originator of this issue please verify that this issue is fixed in the 2.7.1 release, by adding a comment to this issue, so that we can close this issue.

        A lack of response by February 1, 2008 will be taken as consent that we can close this resolved issue.

        Regards,
        Brian Minchau

        Show
        Brian Minchau added a comment - Would the originator of this issue please verify that this issue is fixed in the 2.7.1 release, by adding a comment to this issue, so that we can close this issue. A lack of response by February 1, 2008 will be taken as consent that we can close this resolved issue. Regards, Brian Minchau
        Brian Minchau made changes -
        Affects Version/s 2.7 [ 11080 ]
        Affects Version/s 2.7.1 [ 10863 ]
        Hide
        Brian Minchau added a comment -

        The fixed version is marked as 2.7.1.
        The affecteced version was marked as 2.7.1, which is clearly wrong.
        Changing the affected version to 2.7, which is better, though perhaps not fully accurate.

        Show
        Brian Minchau added a comment - The fixed version is marked as 2.7.1. The affecteced version was marked as 2.7.1, which is clearly wrong. Changing the affected version to 2.7, which is better, though perhaps not fully accurate.
        Hide
        Brian Minchau added a comment -

        There is already a testcase, entref01 that tries to output this (among other things):
        <A HREF="&">

        The gold file has the above line. With the patch the output is:
        <A HREF="&">

        However the XML parser reads both attribute values in as an ampersand,
        and when the testcase runs in the harness it still passes.

        Show
        Brian Minchau added a comment - There is already a testcase, entref01 that tries to output this (among other things): <A HREF="&"> The gold file has the above line. With the patch the output is: <A HREF="&"> However the XML parser reads both attribute values in as an ampersand, and when the testcase runs in the harness it still passes.
        Brian Minchau made changes -
        Resolution Fixed [ 1 ]
        Fix Version/s Latest Development Code [ 10863 ]
        Status Reopened [ 4 ] Resolved [ 5 ]
        Hide
        Brian Minchau added a comment -

        Fixed, the patch was applied to the latest development code.

        Show
        Brian Minchau added a comment - Fixed, the patch was applied to the latest development code.
        Hide
        Yash Talwar added a comment -

        I have reviewed the changes made in code. The changes look good to me.
        I approve.

        Yash.

        Show
        Yash Talwar added a comment - I have reviewed the changes made in code. The changes look good to me. I approve. Yash.
        Brian Minchau made changes -
        Environment Operating System: All
        Platform: All
        Operating System: All
        Platform: All
        Xalan info [PatchAvailable]
        Bugzilla Id 4328
        Description Using the HTML output method, an ampersand is serialized as "&". It should
        be "&amp;".
        Using the HTML output method, an ampersand is serialized as "&". It should
        be "&amp;".
        Brian Minchau made changes -
        Assignee Xalan Developers Mailing List [ xalan-dev@xml.apache.org ] Brian Minchau [ minchau@ca.ibm.com ]
        Brian Minchau made changes -
        Attachment patch.to.ToHTMLStream.txt [ 12323904 ]
        Hide
        Brian Minchau added a comment -

        Attaching a patch to ToHTMLStream.java that modifies its writeAttrURI() method to special case a literal '&' just like it special cases a quotation character.

        The patch writes out "&" for a literal '&' in an HTML URI attribute value.

        Henry Zongaro's sample XSL is essentially this:
        <?xml version="1.0"?>
        <xsl:stylesheet
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        version="1.0">
        <xsl:template match="/">
        <HTML><a href="http://example.org/part&amp;parcel"/></HTML>
        </xsl:template>
        </xsl:stylesheet>

        The original output is:
        <HTML>
        <a href="http://example.org/part&parcel"></a>
        </HTML>

        With the patch the output is:
        <HTML>
        <a href="http://example.org/part&amp;parcel"></a>
        </HTML>

        Show
        Brian Minchau added a comment - Attaching a patch to ToHTMLStream.java that modifies its writeAttrURI() method to special case a literal '&' just like it special cases a quotation character. The patch writes out "&" for a literal '&' in an HTML URI attribute value. Henry Zongaro's sample XSL is essentially this: <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="/"> <HTML><a href="http://example.org/part&amp;parcel"/></HTML> </xsl:template> </xsl:stylesheet> The original output is: <HTML> <a href="http://example.org/part&parcel"></a> </HTML> With the patch the output is: <HTML> <a href="http://example.org/part&amp;parcel"></a> </HTML>
        Hide
        Nick Fitzsimons added a comment -

        Thanks for the feedback, Henry.

        Although it is perhaps presumptious of me to second-guess the deliberations of the relevant W3C WG, I think it's reasonable to conjecture that the use of "should" as opposed to "must" (in the RFC 2119 sense) rwas prompted by the fact that existing User Agents tolerate the unescaped ampersands in the "href" attribute of the "a" element, and a number of web page creation tools, at the time HTML 4.01 was written and probably to this day, would produce such markup. "Must" was probably seen as imposing too rigid a constraint, given that the then-forthcoming formulation of XHTML 1.0 would impose stricter requirements. However, the intention (I believe) was to promote correct usage from that time onwards, which means Xalan really should (no pun intended) be producing & in this context.

        Although I haven't yet read the source, I would assume (given that other attributes don't suffer from this) that the code is treating "href" as a special case, implying that the fix might be relatively straightforward (not that fixes ever turn out to be as straightforward as one imagines). If I get time over the next couple of days I'll start getting to know the source and see if I can either identify a fix, or at least provide some pointers to those with more experience of the codebase.

        Show
        Nick Fitzsimons added a comment - Thanks for the feedback, Henry. Although it is perhaps presumptious of me to second-guess the deliberations of the relevant W3C WG, I think it's reasonable to conjecture that the use of "should" as opposed to "must" (in the RFC 2119 sense) rwas prompted by the fact that existing User Agents tolerate the unescaped ampersands in the "href" attribute of the "a" element, and a number of web page creation tools, at the time HTML 4.01 was written and probably to this day, would produce such markup. "Must" was probably seen as imposing too rigid a constraint, given that the then-forthcoming formulation of XHTML 1.0 would impose stricter requirements. However, the intention (I believe) was to promote correct usage from that time onwards, which means Xalan really should (no pun intended) be producing & in this context. Although I haven't yet read the source, I would assume (given that other attributes don't suffer from this) that the code is treating "href" as a special case, implying that the fix might be relatively straightforward (not that fixes ever turn out to be as straightforward as one imagines). If I get time over the next couple of days I'll start getting to know the source and see if I can either identify a fix, or at least provide some pointers to those with more experience of the codebase.
        Hide
        Henry Zongaro added a comment -

        The relevant text in HTML 4.01 reads, "Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values."

        The word "should" is defined by RFC 2119 to mean "that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course." So it's not an absolute requirement that ampersand be emitted as the character entity reference &.

        However, if you put this example literal result element in a stylesheet

        <a href="http://example.org/part&amp;parcel"/>

        you'll see this result

        <a href="http://example.org/part&parcel"/>

        which is clearly not what was intended.

        So it seems the full implications were not understood or not carefully weighed before choosing a different course.

        I've asked Brian Minchau to put this on the agenda for the next Xalan-J bug triage call.

        Show
        Henry Zongaro added a comment - The relevant text in HTML 4.01 reads, "Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values." The word "should" is defined by RFC 2119 to mean "that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course." So it's not an absolute requirement that ampersand be emitted as the character entity reference &. However, if you put this example literal result element in a stylesheet <a href="http://example.org/part&amp;parcel"/> you'll see this result <a href="http://example.org/part&parcel"/> which is clearly not what was intended. So it seems the full implications were not understood or not carefully weighed before choosing a different course. I've asked Brian Minchau to put this on the agenda for the next Xalan-J bug triage call.
        Hide
        Nick Fitzsimons added a comment -

        This is a serious flaw in Xalan, and I can't believe it has been around since 2001. The simple fact is that Xalan, when used with output method HTML, fails to produce valid HTML, and therefore fails to correctly implement XSLT 1.0. I'm not sure why the matter was "extensively debated" as it takes about two minutes reading the XSLT 1.0 and HTML 4.01specs to determine that:

        1. The XSLT 1.0 spec is rather vague about outputting character entities in HTML [XSLT 1.0 section 16.2];
        2. Nonetheless, it clearly states that the processor should produce valid HTML, defaulting to HTML 4 if no version is specified [XSLT 1.0 section 16.2];
        3. Unescaped ampersands in href attributes are not valid HTML 4 (or 4.01) [HTML 4.0 section 5.3.2] and [HTML 4.01 section 5.3.2].

        Is there likely to be any progress on this issue soon? As was noted in 2002, Saxon and MSXML both manage to work correctly on this one.

        Show
        Nick Fitzsimons added a comment - This is a serious flaw in Xalan, and I can't believe it has been around since 2001. The simple fact is that Xalan, when used with output method HTML, fails to produce valid HTML, and therefore fails to correctly implement XSLT 1.0. I'm not sure why the matter was "extensively debated" as it takes about two minutes reading the XSLT 1.0 and HTML 4.01specs to determine that: 1. The XSLT 1.0 spec is rather vague about outputting character entities in HTML [XSLT 1.0 section 16.2] ; 2. Nonetheless, it clearly states that the processor should produce valid HTML, defaulting to HTML 4 if no version is specified [XSLT 1.0 section 16.2] ; 3. Unescaped ampersands in href attributes are not valid HTML 4 (or 4.01) [HTML 4.0 section 5.3.2] and [HTML 4.01 section 5.3.2] . Is there likely to be any progress on this issue soon? As was noted in 2002, Saxon and MSXML both manage to work correctly on this one.
        Brian Minchau made changes -
        Affects Version/s 2.2.x [ 10862 ]
        Affects Version/s Latest Development Code [ 10863 ]
        Henry Zongaro made changes -
        Priority Major [ 3 ]
        Serge Knystautas made changes -
        Field Original Value New Value
        issue.field.bugzillaimportkey 4328 25520
        Hide
        Owen Rees added a comment -

        To me, the "The html output method should not escape a & character occurring in
        an attribute value immediately followed by a { character" in section 16.1 of the
        XSLT 1.0 spec implies that & should otherwise be escaped to its character entity
        reference. This is consistent with the HTML spec. The later XSLT specs I have
        looked at (and the new "XSLT 2.0 and XQuery 1.0 Serialization" draft to which
        the relevant section has been moved) also use the same words.

        Since & is an ascii character, the special handling on non-ASCII in URI
        attribute values does not apply.

        As far as I can see this is a bug with respect to XSLT 1.0 and its succcessors,
        as well as causing invalid HTML to be generated.

        Show
        Owen Rees added a comment - To me, the "The html output method should not escape a & character occurring in an attribute value immediately followed by a { character" in section 16.1 of the XSLT 1.0 spec implies that & should otherwise be escaped to its character entity reference. This is consistent with the HTML spec. The later XSLT specs I have looked at (and the new "XSLT 2.0 and XQuery 1.0 Serialization" draft to which the relevant section has been moved) also use the same words. Since & is an ascii character, the special handling on non-ASCII in URI attribute values does not apply. As far as I can see this is a bug with respect to XSLT 1.0 and its succcessors, as well as causing invalid HTML to be generated.
        Hide
        Stuart Roebuck added a comment -

        Has there been any progress with this bug, it effectively means that any web site produced using Xalan and containing multiple parameters in links will fail HTML checks?

        From the previous discussion is it clear to everyone that this is a bug?

        Show
        Stuart Roebuck added a comment - Has there been any progress with this bug, it effectively means that any web site produced using Xalan and containing multiple parameters in links will fail HTML checks? From the previous discussion is it clear to everyone that this is a bug?
        Hide
        julian.reschke added a comment -

        >I gather that you're referring to ampersands inside URI attribute values, since
        >we already emit & in regular text. Further, I gather that you're not

        Yes. I was talking about HREF attributes when using the HTML output method.

        > referring to the special treatment of the &{ sequence discussed in XSLT 16.2.

        Correct.

        > Are you referring to HTML 4.0 or an earlier version? Checking subsection B.2.2

        No particular version.

        > of HTML 4.01, I see they recommend & or & in URI attribute values.
        > (%46 should not be used!)

        Yes. In particular, serializing it as simple "&" is wrong. That's what this bug
        report is about.

        Show
        julian.reschke added a comment - >I gather that you're referring to ampersands inside URI attribute values, since >we already emit & in regular text. Further, I gather that you're not Yes. I was talking about HREF attributes when using the HTML output method. > referring to the special treatment of the &{ sequence discussed in XSLT 16.2. Correct. > Are you referring to HTML 4.0 or an earlier version? Checking subsection B.2.2 No particular version. > of HTML 4.01, I see they recommend & or & in URI attribute values. > (%46 should not be used!) Yes. In particular, serializing it as simple "&" is wrong. That's what this bug report is about.
        Hide
        david_marston added a comment -

        I gather that you're referring to ampersands inside URI attribute values, since
        we already emit & in regular text. Further, I gather that you're not
        referring to the special treatment of the &{ sequence discussed in XSLT 16.2.

        Are you referring to HTML 4.0 or an earlier version? Checking subsection B.2.2
        of HTML 4.01, I see they recommend & or & in URI attribute values.
        (%46 should not be used!)

        Show
        david_marston added a comment - I gather that you're referring to ampersands inside URI attribute values, since we already emit & in regular text. Further, I gather that you're not referring to the special treatment of the &{ sequence discussed in XSLT 16.2. Are you referring to HTML 4.0 or an earlier version? Checking subsection B.2.2 of HTML 4.01, I see they recommend & or & in URI attribute values. (%46 should not be used!)
        Hide
        julian.reschke added a comment -

        XSLT specifies that the "html" output method must create valid HTML (as per W3C
        spec). The HTML spec says that ampersands need to be escaped as &-a-m-p-;
        unless when appearing in CDATA (that is PRE or SCRIPT).

        So the answers are "no" and "yes".

        BTW: you might also compare with the output generated by Saxon or MSXML.

        Show
        julian.reschke added a comment - XSLT specifies that the "html" output method must create valid HTML (as per W3C spec). The HTML spec says that ampersands need to be escaped as &-a-m-p-; unless when appearing in CDATA (that is PRE or SCRIPT). So the answers are "no" and "yes". BTW: you might also compare with the output generated by Saxon or MSXML.
        Hide
        Joe Kesselman added a comment -

        Question is, independent of what HTML would like to see, are we generating
        correct results per the XSLT spec – and would we still be doing so if we
        changed it?

        If the answers are yes and no respectively, we should probably be reporting the
        issue to the XSL Working Group rather than implementing a short-term nonportable
        workaround.

        Show
        Joe Kesselman added a comment - Question is, independent of what HTML would like to see, are we generating correct results per the XSLT spec – and would we still be doing so if we changed it? If the answers are yes and no respectively, we should probably be reporting the issue to the XSL Working Group rather than implementing a short-term nonportable workaround.
        Hide
        julian.reschke added a comment -

        I don't understand the "wontfix". It's clearly wrong. Check the HTML spec.

        Show
        julian.reschke added a comment - I don't understand the "wontfix". It's clearly wrong. Check the HTML spec.
        Hide
        david_marston added a comment -

        This has been debated extensively here. We settled on a short-term action,
        and it may take improvements in the XSLT 2.0 spec to achieve flexibility.
        (But you are welcome to post an enhancement request for an option that controls
        serialization of certain HTML attributes.)

        Show
        david_marston added a comment - This has been debated extensively here. We settled on a short-term action, and it may take improvements in the XSLT 2.0 spec to achieve flexibility. (But you are welcome to post an enhancement request for an option that controls serialization of certain HTML attributes.)
        julian.reschke created issue -

          People

          • Assignee:
            Brian Minchau
            Reporter:
            julian.reschke
          • Votes:
            5 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development