|
I don't understand the "wontfix". It's clearly wrong. Check the HTML spec.
Question is, independent of what HTML would like to see, are we generating
correct results per the XSLT spec -- and would we still be doing so if we changed it? If the answers are yes and no respectively, we should probably be reporting the issue to the XSL Working Group rather than implementing a short-term nonportable workaround. XSLT specifies that the "html" output method must create valid HTML (as per W3C
spec). The HTML spec says that ampersands need to be escaped as &-a-m-p-; unless when appearing in CDATA (that is PRE or SCRIPT). So the answers are "no" and "yes". BTW: you might also compare with the output generated by Saxon or MSXML. I gather that you're referring to ampersands inside URI attribute values, since
we already emit & in regular text. Further, I gather that you're not referring to the special treatment of the &{ sequence discussed in XSLT 16.2. Are you referring to HTML 4.0 or an earlier version? Checking subsection B.2.2 of HTML 4.01, I see they recommend & or & in URI attribute values. (%46 should not be used!) >I gather that you're referring to ampersands inside URI attribute values, since
>we already emit & in regular text. Further, I gather that you're not Yes. I was talking about HREF attributes when using the HTML output method. > referring to the special treatment of the &{ sequence discussed in XSLT 16.2. Correct. > Are you referring to HTML 4.0 or an earlier version? Checking subsection B.2.2 No particular version. > of HTML 4.01, I see they recommend & or & in URI attribute values. > (%46 should not be used!) Yes. In particular, serializing it as simple "&" is wrong. That's what this bug report is about. Has there been any progress with this bug, it effectively means that any web site produced using Xalan and containing multiple parameters in links will fail HTML checks?
From the previous discussion is it clear to everyone that this is a bug? To me, the "The html output method should not escape a & character occurring in
an attribute value immediately followed by a { character" in section 16.1 of the XSLT 1.0 spec implies that & should otherwise be escaped to its character entity reference. This is consistent with the HTML spec. The later XSLT specs I have looked at (and the new "XSLT 2.0 and XQuery 1.0 Serialization" draft to which the relevant section has been moved) also use the same words. Since & is an ascii character, the special handling on non-ASCII in URI attribute values does not apply. As far as I can see this is a bug with respect to XSLT 1.0 and its succcessors, as well as causing invalid HTML to be generated. This is a serious flaw in Xalan, and I can't believe it has been around since 2001. The simple fact is that Xalan, when used with output method HTML, fails to produce valid HTML, and therefore fails to correctly implement XSLT 1.0. I'm not sure why the matter was "extensively debated" as it takes about two minutes reading the XSLT 1.0 and HTML 4.01specs to determine that:
1. The XSLT 1.0 spec is rather vague about outputting character entities in HTML [XSLT 1.0 section 16.2]; 2. Nonetheless, it clearly states that the processor should produce valid HTML, defaulting to HTML 4 if no version is specified [XSLT 1.0 section 16.2]; 3. Unescaped ampersands in href attributes are not valid HTML 4 (or 4.01) [HTML 4.0 section 5.3.2] and [HTML 4.01 section 5.3.2]. Is there likely to be any progress on this issue soon? As was noted in 2002, Saxon and MSXML both manage to work correctly on this one. The relevant text in HTML 4.01 reads, "Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values."
The word "should" is defined by RFC 2119 to mean "that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course." So it's not an absolute requirement that ampersand be emitted as the character entity reference &. However, if you put this example literal result element in a stylesheet <a href="http://example.org/part&amp;parcel"/> you'll see this result <a href="http://example.org/part&parcel"/> which is clearly not what was intended. So it seems the full implications were not understood or not carefully weighed before choosing a different course. I've asked Brian Minchau to put this on the agenda for the next Xalan-J bug triage call. Thanks for the feedback, Henry.
Although it is perhaps presumptious of me to second-guess the deliberations of the relevant W3C WG, I think it's reasonable to conjecture that the use of "should" as opposed to "must" (in the RFC 2119 sense) rwas prompted by the fact that existing User Agents tolerate the unescaped ampersands in the "href" attribute of the "a" element, and a number of web page creation tools, at the time HTML 4.01 was written and probably to this day, would produce such markup. "Must" was probably seen as imposing too rigid a constraint, given that the then-forthcoming formulation of XHTML 1.0 would impose stricter requirements. However, the intention (I believe) was to promote correct usage from that time onwards, which means Xalan really should (no pun intended) be producing & in this context. Although I haven't yet read the source, I would assume (given that other attributes don't suffer from this) that the code is treating "href" as a special case, implying that the fix might be relatively straightforward (not that fixes ever turn out to be as straightforward as one imagines). If I get time over the next couple of days I'll start getting to know the source and see if I can either identify a fix, or at least provide some pointers to those with more experience of the codebase. Attaching a patch to ToHTMLStream.java that modifies its writeAttrURI() method to special case a literal '&' just like it special cases a quotation character.
The patch writes out "&" for a literal '&' in an HTML URI attribute value. Henry Zongaro's sample XSL is essentially this: <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="/"> <HTML><a href="http://example.org/part&amp;parcel"/></HTML> </xsl:template> </xsl:stylesheet> The original output is: <HTML> <a href="http://example.org/part&parcel"></a> </HTML> With the patch the output is: <HTML> <a href="http://example.org/part&amp;parcel"></a> </HTML> I have reviewed the changes made in code. The changes look good to me.
I approve. Yash. Fixed, the patch was applied to the latest development code.
There is already a testcase, entref01 that tries to output this (among other things):
<A HREF="&"> The gold file has the above line. With the patch the output is: <A HREF="&"> However the XML parser reads both attribute values in as an ampersand, and when the testcase runs in the harness it still passes. The fixed version is marked as 2.7.1.
The affecteced version was marked as 2.7.1, which is clearly wrong. Changing the affected version to 2.7, which is better, though perhaps not fully accurate. Would the originator of this issue please verify that this issue is fixed in the 2.7.1 release, by adding a comment to this issue, so that we can close this issue.
A lack of response by February 1, 2008 will be taken as consent that we can close this resolved issue. Regards, Brian Minchau |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
and it may take improvements in the XSLT 2.0 spec to achieve flexibility.
(But you are welcome to post an enhancement request for an option that controls
serialization of certain HTML attributes.)