DdlUtils
  1. DdlUtils
  2. DDLUTILS-174

Not all "special characters" in content detected

    Details

      Description

      Besides the characters already detected in DataWriter.containsSpecialCharacters there is another "special character" that requires using base64 encoding: "<Unable to render embedded object: File ([CDATA[". Because content is written to XML within "<) not found.[CDATA[...]]>" it may not contain "<![CDATA[".

      1. patch.txt
        1 kB
        Michael Lipp
      2. liferay-db-schema.xml
        4 kB
        Michael Lipp
      3. populate-liferay-db.xml
        15 kB
        Michael Lipp

        Activity

        Hide
        Michael Lipp added a comment -

        Detect "<![CDATA[" as special character.

        Show
        Michael Lipp added a comment - Detect "<![CDATA[" as special character.
        Hide
        Thomas Dudziak added a comment -

        Mhmm, "<Unable to render embedded object: File ([CDATA[" is not actually a valid value for an XML attribute or Text element, because the '<' needs to be escaped (<). Hence, DdlUtils writes "<) not found.[CDATA[" which is fine as far as XML is concerned.

        Show
        Thomas Dudziak added a comment - Mhmm, "< Unable to render embedded object: File ([CDATA[" is not actually a valid value for an XML attribute or Text element, because the '<' needs to be escaped (<). Hence, DdlUtils writes "<) not found. [CDATA[" which is fine as far as XML is concerned.
        Hide
        Michael Lipp added a comment -

        Here comes what happens in greater detail.

        I have a database that has XML data in varchars. When DdlUtils writes the data to its XML-<data> file, it puts "<Unable to render embedded object: File ([CDATA[...]]>" around the content of varchar columns' content that is written as XML Text and contains XML (I don't know if columns with XML-content are ever written as attributes, it doesn't happen with my database). In general this approach is fine. If, however, the XML contained in the column already includes "<) not found.[CDATA[...]]>" sections, this approach fails because you get nested CDATAs.

        Your comment seems to assume that DdlUtils escapes all '<'s instead of using "<Unable to render embedded object: File ([CDATA[...]]>". This assumption is definitely wrong) not found. I can send you the output file, but I think you believe me that I can tell the difference .

        Of course, protecting a column's XML content by escaping the '<'s instead of using "<Unable to render embedded object: File ([CDATA[...]]>" is a valid alternative to my approach of considering "<) not found.[CDATA[" a "special character". But either must be implemented. As DdlUtils behaves currently, we get an invalid <data>-file.

        Show
        Michael Lipp added a comment - Here comes what happens in greater detail. I have a database that has XML data in varchars. When DdlUtils writes the data to its XML-<data> file, it puts "< Unable to render embedded object: File ([CDATA[...]]>" around the content of varchar columns' content that is written as XML Text and contains XML (I don't know if columns with XML-content are ever written as attributes, it doesn't happen with my database). In general this approach is fine. If, however, the XML contained in the column already includes "<) not found. [CDATA [...] ]>" sections, this approach fails because you get nested CDATAs. Your comment seems to assume that DdlUtils escapes all '<'s instead of using "< Unable to render embedded object: File ([CDATA[...]]>". This assumption is definitely wrong) not found. I can send you the output file, but I think you believe me that I can tell the difference . Of course, protecting a column's XML content by escaping the '<'s instead of using "< Unable to render embedded object: File ([CDATA[...]]>" is a valid alternative to my approach of considering "<) not found. [CDATA[" a "special character". But either must be implemented. As DdlUtils behaves currently, we get an invalid <data>-file.
        Hide
        Thomas Dudziak added a comment -

        As for the XML attributes, DdlUtils generates attributes only if the length of the textual representation of the attribute's value does not exceed 255 characters and it does not contain characters that are illegal in XML (e.g. \0).

        As for the <![CDATA[ ]]> section, assume we have an object read from the database with two attributes value1 and value2, both containing this XML snippet:

        <?xml version="1.0" encoding="ISO-8859-1"?><test><![CDATA[some text]]></test>

        When written by DdlUtils to a data XML file (and I forced DdlUtils to write value2 to a sub element rather than an attribute), this will look like this:

        <?xml version='1.0' encoding='UTF-8'?>
        <data>
        <test id="1" value1="<?xml version="1.0" encoding="ISO-8859-1"?><test><![CDATA[some text]]></test>">
        <value2><Unable to render embedded object: File ([CDATA[<?xml version="1.0" encoding="ISO-8859-1"?><test><) not found.[CDATA[some text]]]]><![CDATA[></test>]]></value2>
        </test>
        </data>

        (which parses fine in a standards-compliant XML parser).

        The interesting part here is that an XML parser does not care about the beginning of the embedded CDATA part, but only about the end of it. And if you look at the above, then the two brackets of the end of the embedded CDATA section are contained in a different 'real' CDATA section than the '>' character. Hence, parsing poses no problem because the ]] and the > are not seen as belonging together and thus won't end a 'real' CDATA section prematurely.

        Now, if you have a case where this does not work, then that would be a bug, in which case please attach e.g. database schema and data export using for instance INSERT statements, to this issue and I'll fix it.

        Show
        Thomas Dudziak added a comment - As for the XML attributes, DdlUtils generates attributes only if the length of the textual representation of the attribute's value does not exceed 255 characters and it does not contain characters that are illegal in XML (e.g. \0). As for the <![CDATA[ ]]> section, assume we have an object read from the database with two attributes value1 and value2, both containing this XML snippet: <?xml version="1.0" encoding="ISO-8859-1"?><test><![CDATA [some text] ]></test> When written by DdlUtils to a data XML file (and I forced DdlUtils to write value2 to a sub element rather than an attribute), this will look like this: <?xml version='1.0' encoding='UTF-8'?> <data> <test id="1" value1="<?xml version="1.0" encoding="ISO-8859-1"?><test><![CDATA [some text] ]></test>"> <value2>< Unable to render embedded object: File ([CDATA[<?xml version="1.0" encoding="ISO-8859-1"?><test><) not found. [CDATA [some text] ]]]><![CDATA [></test>] ]></value2> </test> </data> (which parses fine in a standards-compliant XML parser). The interesting part here is that an XML parser does not care about the beginning of the embedded CDATA part, but only about the end of it. And if you look at the above, then the two brackets of the end of the embedded CDATA section are contained in a different 'real' CDATA section than the '>' character. Hence, parsing poses no problem because the ]] and the > are not seen as belonging together and thus won't end a 'real' CDATA section prematurely. Now, if you have a case where this does not work, then that would be a bug, in which case please attach e.g. database schema and data export using for instance INSERT statements, to this issue and I'll fix it.
        Hide
        Michael Lipp added a comment -

        Please find attached the output of databaseToDdl that shows the described error. It is from a HSQL database.

        How to reproduce: as the data is not valid XML (this is what all this is about) you have to edit the file in order to be able to restore it. Remove the "<![CDATA[" after "<CONTENT>" and the "]]>" before "</CONTENT>" . Replace (between "<CONTENT>" and "</CONTENT>") all occurences of '<' with "<" and '>' with ">". Then use ddlToDatabase to restore the database.

        Dump the restored database. You'll find that it produces the files attached, including the faulty nesting of CDATA-sections.

        Show
        Michael Lipp added a comment - Please find attached the output of databaseToDdl that shows the described error. It is from a HSQL database. How to reproduce: as the data is not valid XML (this is what all this is about) you have to edit the file in order to be able to restore it. Remove the "<![CDATA [" after "<CONTENT>" and the "] ]>" before "</CONTENT>" . Replace (between "<CONTENT>" and "</CONTENT>") all occurences of '<' with "<" and '>' with ">". Then use ddlToDatabase to restore the database. Dump the restored database. You'll find that it produces the files attached, including the faulty nesting of CDATA-sections.
        Hide
        Thomas Dudziak added a comment -

        After some testing I found the problem - it is actually a bug in the default XML stream writer in the JDK which does not properly escape CDATA sequences. Woodstox (http://woodstox.codehaus.org/) which is used by the unit tests, does not have this problem which is why I couldn't reproduce it in the tests. To fix the problem, simply add the woodstox jar (wstx-asl-[version].jar, comes with DdlUtils in the lib folder) to your classpath, Woodstox then should be used instead of the default XML stream writer.

        Show
        Thomas Dudziak added a comment - After some testing I found the problem - it is actually a bug in the default XML stream writer in the JDK which does not properly escape CDATA sequences. Woodstox ( http://woodstox.codehaus.org/ ) which is used by the unit tests, does not have this problem which is why I couldn't reproduce it in the tests. To fix the problem, simply add the woodstox jar (wstx-asl- [version] .jar, comes with DdlUtils in the lib folder) to your classpath, Woodstox then should be used instead of the default XML stream writer.
        Hide
        Michael Lipp added a comment -

        Thanks. Are you sure that it is a bug in the JDK XML stream writer? Has it been filed? Might it also be wrong API usage?

        I'm insisting because this means that we have to replace a standard library from JDK with some 3rd-party-library. In order to do this, we have to convince our customers' QA that this is really necessary. The bug id under which this is filed for the JDK would help a lot.

        Show
        Michael Lipp added a comment - Thanks. Are you sure that it is a bug in the JDK XML stream writer? Has it been filed? Might it also be wrong API usage? I'm insisting because this means that we have to replace a standard library from JDK with some 3rd-party-library. In order to do this, we have to convince our customers' QA that this is really necessary. The bug id under which this is filed for the JDK would help a lot.
        Hide
        Thomas Dudziak added a comment -

        I'm quite sure because the API usage boils down to

        writeStartElement()
        writeCData()
        writeEndElement()

        (see http://svn.apache.org/viewvc/db/ddlutils/trunk/src/java/org/apache/ddlutils/io/DataWriter.java?view=markup in method write(SqlDynaBean).)

        It's not filed though AFAICS in Sun's bug database.
        Btw, if I'm not mistaken, only beginning with JDK6 a stax implementation is provided by the JDK, so it is a relatively new technology in the JDK anyways whereas Woodstox has been around the block for some time now. If you need some more arguments for using Woodstox, see the paragraph "Why use Woodstox of all available StAX implementations?" on the Woodstox homepage (http://woodstox.codehaus.org).

        Show
        Thomas Dudziak added a comment - I'm quite sure because the API usage boils down to writeStartElement() writeCData() writeEndElement() (see http://svn.apache.org/viewvc/db/ddlutils/trunk/src/java/org/apache/ddlutils/io/DataWriter.java?view=markup in method write(SqlDynaBean).) It's not filed though AFAICS in Sun's bug database. Btw, if I'm not mistaken, only beginning with JDK6 a stax implementation is provided by the JDK, so it is a relatively new technology in the JDK anyways whereas Woodstox has been around the block for some time now. If you need some more arguments for using Woodstox, see the paragraph "Why use Woodstox of all available StAX implementations?" on the Woodstox homepage ( http://woodstox.codehaus.org ).
        Hide
        Michael Lipp added a comment -

        I'm afraid this is exactly what I hinted at by "wrong API usage". Java specifications aren't always as clear as one might wish.

        Looking at XMLStreamWriter JavaDoc (from the JSR) you find: "The XMLStreamWriter does not perform well formedness checking on its input." Well, writing CDATA that contains "<Unable to render embedded object: File ([CDATA[...]]>" is writing data that is not wellformed, because CDATA sections may not contain "<) not found.[CDATA[...]]>".

        What woodstox does is (a) check wellformedness and (even worse) (b) modify input. By this it it does not behave according to the specification. The only case in which XMLStreamWriter is allowed to modify input is clearly mentioned in the API: "However the writeCharacters method is required to escape & , < and > For attribute values the writeAttribute method will escape the above characters plus " to ensure that all character content and attribute values are well formed."

        So what woodstox does to strings written with writeCData might be convenient, but it does not follow the specification.

        I have filed this as woodstox bug WSTX-120.

        Show
        Michael Lipp added a comment - I'm afraid this is exactly what I hinted at by "wrong API usage". Java specifications aren't always as clear as one might wish. Looking at XMLStreamWriter JavaDoc (from the JSR) you find: "The XMLStreamWriter does not perform well formedness checking on its input." Well, writing CDATA that contains "< Unable to render embedded object: File ([CDATA[...]]>" is writing data that is not wellformed, because CDATA sections may not contain "<) not found. [CDATA [...] ]>". What woodstox does is (a) check wellformedness and (even worse) (b) modify input. By this it it does not behave according to the specification. The only case in which XMLStreamWriter is allowed to modify input is clearly mentioned in the API: "However the writeCharacters method is required to escape & , < and > For attribute values the writeAttribute method will escape the above characters plus " to ensure that all character content and attribute values are well formed." So what woodstox does to strings written with writeCData might be convenient, but it does not follow the specification. I have filed this as woodstox bug WSTX-120.
        Hide
        Thomas Dudziak added a comment -

        Fixed in the same way as Woodstux handles this problem (multiple CDATA sections), but also works with the default JDK stax implementation.

        Show
        Thomas Dudziak added a comment - Fixed in the same way as Woodstux handles this problem (multiple CDATA sections), but also works with the default JDK stax implementation.

          People

          • Assignee:
            Thomas Dudziak
            Reporter:
            Michael Lipp
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development