|
I can reproduce this with Xalan-J 2.5. The following comment appears in the
serializer code that handles processing instructions (ToXMLStream.processingInstruction): // Always output a newline char if not inside of an // element. The whitespace is not significant in that // case. That is a true statement when the output is considered to be an XML document entity, but if the output is used as an external entity referenced from within an XML document, that extra whitespace could be significant. A variant of this issue came up in some e-mail exchanged between David Marston and Michael Kay. The Xalan-J processors always emit an EOL marker after the XML/Text declaration at the start of the output. That's not significant when the output is treated as an XML document, but it could be significant if the output is treated as an external entity referenced by an XML document. Created an attachment (id=6005)
Input XML document Created an attachment (id=6006)
Example stylesheet I don't think we can ever figure out if the current document being serialized is ever going to be used as an external entity referenced by another XML document in the future. Of the three stream serializers, ToXMLStream, ToHTMLStream and ToTextStream, the first two will append and end-of-line (EOL) after a processing instruction if the processing instruction happens outside of any elements. Is the solution then to never output an EOL, neither when inside of an element, nor outside of any elements? That would be an easy fix, just delete the few lines that output the EOL in the two processingInstruction(target,data) methods. Is that the right thing? Hi, Brian. I believe the only time you can tell that the result really is a
document entity is if it contains a DTD, but you're right that in most cases you can't tell. So, yes, the fix would be to remove the EOL emitted in the processingInstruction methods, and also after the XML declaration. Interestingly, comments don't have the same problem. Testing note: the data for tests copy52 and copy53 could easily be expanded to
have PIs outside the document element. The gold files would have to change to match, so the gold file should reflect the lack-of-newline if that's the intended gold standard. Might want to extend output59 and output72 as well. They test xsl:processing-
instruction. The former actually tests a PI outside the document element, though only for HTML. JIRA Triage meeting Tuesday March 7, 2006 - agreed to modify the behavior to NEVER output a newline after a PI.
Assigned to Brian M. For XML produced from a transformation one never knows how it will be used.
Suppose that one produces this: <?xml version='1.0' encoding='UTF-8'?><?PI-one?> <!-- comment one --> <elem1><elem2>hello</elem2></elem1> Suppose that one serialized with indent='yes'. Where could whitespace (e.g. newlines and spaces for indentation) be inserted? In this case only between <elem1> and <elem2>, or between </elem2> and </elem1>. Adding whitespace before or after a top level PI, comment or whitespace text node may not be correct because this XML could be used as an external general parsed entity. For example suppose that a it was refered to as &egpe; and included in other XML like this: <e>some text&egpe;more text</e> In this case (and so in general) it is not correct to add whitespace to the top level of serialized XML as that whitespace will occur after "some text" or perhaps before "more text". The producer of the serialized XML can not know the context in which that XML will later be used. So even with indentation='yes' we should not put additional whitespace between top level nodes, not even a newline after an XML header! One can only add whitespace for indentation within an element. In the above example indentation, if any, would be within the <elem1> and </elem1> tags, so indentation could look like this: <?xml version='1.0' encoding='UTF-8'?><?PI-one?> <!-- comment one --> <elem1> <elem2>hello</elem2> </elem1> The XSLT recommendation allows modification of the result tree's content when indenting is enabled:
http://www.w3.org/TR/xslt#strip "The xml output method should use an algorithm to output additional whitespace that ensures that the result if whitespace were to be stripped from the output using the process described in [3.4 Whitespace Stripping] with the set of whitespace-preserving elements consisting of just xsl:text would be the same when additional whitespace is output as when additional whitespace is not output. NOTE:It is usually not safe to use indent="yes" with document types that include element types with mixed content." Attaching a patch to change the ToStream.shouldIndent(). It previously returned true only if the indent='yes' was specified, plus other contitions. Yet one more condition was added, that we must be inside of an element, not as a top level node in an XML document or fragment. This has no performance impact when indent='no' (which is the default for XML).
This will have a minor performance impact for some HTML with indent='yes', which is the default, but heck, if you want it done right it usually costs something. Even for HTML if the last thing written out was text, there is no performance impact. A newline is no longer written out immediately after a PI since we don't know if non-whitespace text will follow the PI. Previously a newline was written out after the XML header if indent='yes'. A newline is now only written out after the header when indent='yes' and one of these: > standalone was specified (either yes or no) > A DOCTYPE will be written out.> Attaching a testcase, xalanj-1497.xsl that puts comments, processing instrucitons and text before, in the document-element and after that element.
It also sets indentation to 'yes' and the indentation amount (a xalan specific xsl:output attribute) to '3'. With the fix there is no indentation before or after the output document element. Attaching xalanj-1497.out the gold file for what should be output.
Attaching testcase xalanj-1497.xsl
Attaching xalanj-1497.out ... the gold file.
Attaching patch2.txt which is a slight rework of patch.txt. Henry Zongaro found a bug during the review and this patch has that fix.
Henry found that a stylesheet like this: <xsl:output method="html" doctype-system='abc' /> <xsl:template match="/"> <xsl:comment>abc</xsl:comment> <html/> </xsl:template> put out two DOCTYPE declarations due to a latent bug in the comment() method, which didn't do the usual cleanup of pending issues, such as closing opening start element tags, or handling what to do if no startDocument() call was received (other methods have such code). Ignore my last comment... it was meant for xalanj-2276
I have reviewed and approve Brian's patch.[1]
[1] http://issues.apache.org/jira/secure/attachment/12323946/patch.txt Fixed. The patch was applied to the latest development code.
Would the originator of this issue please verify that this issue is fixed in the 2.7.1 release, by adding a comment to this issue, so that we can close this issue.
A lack of response by February 1, 2008 will be taken as consent that we can close this resolved issue. Regards, Brian Minchau |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
would you please add a simple xml input document that has the problem you
describe to this bug report, and its corresponding output.
Thanks,
Brian Minchau