Uploaded image for project: 'Commons Lang'
  1. Commons Lang
  2. LANG-955

Add methods for removing all invalid characters according to XML 1.0 and XML 1.1 in an input string to StringEscapeUtils

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.1
    • 3.3
    • lang.*
    • Ubuntu 13.10

    Description

      escapeXml lets non-text characters pass through into XML files:

      scala> org.apache.commons.lang3.StringEscapeUtils.escapeXml("\u0004").codePointAt(0)
      res4: Int = 4
      

      I would expect the result to be an exception – either from StringEscapeUtils (refusing to encode it) or, preferably, from String.codePointAt, complaining that the string is empty. \u0004 is not a valid character in XML 1.0, and there is no way to represent it in an XML document – not even by escaping it.

      Wikipedia summarizes the characters that are not allowed in XML – even after escaping: http://en.wikipedia.org/wiki/Valid_characters_in_XML. The reason for disallowing them: XML is a text interchange format, and control characters are not text.

      If StringEscapeUtils.escapeXml allows invalid XML characters through – whether escaped or not – it generates invalid XML. Valid XML parsers will refuse to read such files.

      Attachments

        Issue Links

          Activity

            People

              britter Benedikt Ritter
              adamhooper Adam Hooper
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: