Tika
  1. Tika
  2. TIKA-651

Unescaped attribute value generated

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      I've converted a word document that contains hyperlinks with a complex query component. The & character is not escaped and mozilla complains about that when I write out the XHTML via a content handler that I wrote.

      It's not clear to me whether or not my contenthandler should assume attributes are properly escaped or not.

        Activity

        Hide
        Nick Burch added a comment -

        Isn't it up to the ContentHandler to do any escaping that's required? See http://permalink.gmane.org/gmane.comp.apache.tika.user/590 for the last time this came up

        Show
        Nick Burch added a comment - Isn't it up to the ContentHandler to do any escaping that's required? See http://permalink.gmane.org/gmane.comp.apache.tika.user/590 for the last time this came up
        Hide
        Uwe Schindler added a comment -

        Yes, as per SAX spec, the characters() event gets unescaped text (also startElement()'s attributes). So the code inside the content handler that writes out the text must escape it.

        If someone is interested, I have an xhtml conform content handler (that also makes HTML4-compatible XHTML, normal serializers just produce XML or HTML4), that is based off XALAN/XERCES serializer.jar, but adds some filtering, so the outputted XHTML has correct block tags, has space before "/>", and e.g. always separates start/end tag of some elements like <script></script> even if they are empty (because only attributes set). It also makes style/script always use CDATA with a faked comment to make this even compatible with HTML4, where SCRIPT/STYLE is defined to be CDATA by default (adding fake comments).

        You only need to add a proper DOCTYPE currently (its not added to this handler for our internal purposes, because we also serialize fragments usig that class, so it omits doctypes).

        I attached this document handler, maybe it is of some use for somebody. It needs serializer.jar from XERCES/XALAN to compile and work.

        This is currently as far as I know the only XHTML serializer that produces XHTML that is in generall needed to make XHTML documents behave correctly even with browsers that only support HTML4. It passes both HTML4 and XHTML1 validators (of course the element data needs to be correct).

        Show
        Uwe Schindler added a comment - Yes, as per SAX spec, the characters() event gets unescaped text (also startElement()'s attributes). So the code inside the content handler that writes out the text must escape it. If someone is interested, I have an xhtml conform content handler (that also makes HTML4-compatible XHTML, normal serializers just produce XML or HTML4), that is based off XALAN/XERCES serializer.jar, but adds some filtering, so the outputted XHTML has correct block tags, has space before "/>", and e.g. always separates start/end tag of some elements like <script></script> even if they are empty (because only attributes set). It also makes style/script always use CDATA with a faked comment to make this even compatible with HTML4, where SCRIPT/STYLE is defined to be CDATA by default (adding fake comments). You only need to add a proper DOCTYPE currently (its not added to this handler for our internal purposes, because we also serialize fragments usig that class, so it omits doctypes). I attached this document handler, maybe it is of some use for somebody. It needs serializer.jar from XERCES/XALAN to compile and work. This is currently as far as I know the only XHTML serializer that produces XHTML that is in generall needed to make XHTML documents behave correctly even with browsers that only support HTML4. It passes both HTML4 and XHTML1 validators (of course the element data needs to be correct).
        Hide
        Uwe Schindler added a comment -

        One addition: This content handler can also be perfectly used to make XSL Transformations outputting XHTML that is compatible according to above information:

        Transformer trans = ...; Source src = ...; Writer out = ...;
        trans.transform(src, new SAXResult(new XHTMLContentHandler(out)));
        

        In general this class may be best placed in XERCES/XALAN's serializers, but as XML spec dont need a separate output type of XHTML, I assume the apache XML project would not accept this for their serializers (I have not yet tried, as my relation to those projects is not very close). Maybe I give it a try.

        Maybe you could add this class in the utils of TIKA (as TIKA is based on XHTML)?

        Show
        Uwe Schindler added a comment - One addition: This content handler can also be perfectly used to make XSL Transformations outputting XHTML that is compatible according to above information: Transformer trans = ...; Source src = ...; Writer out = ...; trans.transform(src, new SAXResult( new XHTMLContentHandler(out))); In general this class may be best placed in XERCES/XALAN's serializers, but as XML spec dont need a separate output type of XHTML, I assume the apache XML project would not accept this for their serializers (I have not yet tried, as my relation to those projects is not very close). Maybe I give it a try. Maybe you could add this class in the utils of TIKA (as TIKA is based on XHTML)?
        Hide
        Chris A. Mattmann added a comment -

        Thanks Uwe!

        Show
        Chris A. Mattmann added a comment - Thanks Uwe!
        Hide
        Raimund Merkert added a comment -

        I agree (and it's actually quite obvious).
        As far as I'm concerned, the issue is resolved. It would be nice to have a full XHTML content handler, because that's why I wrote mine.

        Show
        Raimund Merkert added a comment - I agree (and it's actually quite obvious). As far as I'm concerned, the issue is resolved. It would be nice to have a full XHTML content handler, because that's why I wrote mine.
        Hide
        Uwe Schindler added a comment -

        This is why I added the one I use quite everywhere for producing XHTML. Maybe we can add it to either serializer.jar (via Apache XML) or TIKA.

        Show
        Uwe Schindler added a comment - This is why I added the one I use quite everywhere for producing XHTML. Maybe we can add it to either serializer.jar (via Apache XML) or TIKA.
        Hide
        Raimund Merkert added a comment -

        Great, works nicely.

        I added some code to work around an issue with the img tag and now validates correctly with http://validator.w3.org/check.

         
        		ContentHandler serializer = new XHTMLSerializer(new OutputStreamWriter(
        				xmlResult)) {
        
        			{
        				super.setDoctype(
        						"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd",
        						"-//W3C//DTD XHTML 1.0 Transitional//EN");
        			}
        
        			@Override
        			public void startElement(String namespaceURI, String localName,
        					String name, Attributes atts) throws SAXException {
        				super.startElement(namespaceURI, localName, name, atts);
        				if ("img".equals(localName)
        						&& "http://www.w3.org/1999/xhtml".equals(namespaceURI)) {
        					if (atts.getValue("alt") == null) {
        						addAttribute("alt", "");
        					}
        				}
        			}
        		};
        
        Show
        Raimund Merkert added a comment - Great, works nicely. I added some code to work around an issue with the img tag and now validates correctly with http://validator.w3.org/check . ContentHandler serializer = new XHTMLSerializer(new OutputStreamWriter( xmlResult)) { { super.setDoctype( "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd", "-//W3C//DTD XHTML 1.0 Transitional//EN"); } @Override public void startElement(String namespaceURI, String localName, String name, Attributes atts) throws SAXException { super.startElement(namespaceURI, localName, name, atts); if ("img".equals(localName) && "http://www.w3.org/1999/xhtml".equals(namespaceURI)) { if (atts.getValue("alt") == null) { addAttribute("alt", ""); } } } };
        Hide
        Jukka Zitting added a comment -

        Sounds like this would be a better match for Xerces/Xalan instead of Tika, especially since it has a dependency on them. Resolve as Won't Fix?

        Show
        Jukka Zitting added a comment - Sounds like this would be a better match for Xerces/Xalan instead of Tika, especially since it has a dependency on them. Resolve as Won't Fix?
        Hide
        Uwe Schindler added a comment -

        I have no preference. I just provide the code here for anybody interested. But indeed, it would be fine to have a XHTML serializer in Apache XML's serializer.jar

        Show
        Uwe Schindler added a comment - I have no preference. I just provide the code here for anybody interested. But indeed, it would be fine to have a XHTML serializer in Apache XML's serializer.jar
        Hide
        Jukka Zitting added a comment -

        Resolving as Won't Fix as there's no obvious place where this would fit well inside Tika.

        We can leave the code here for people to find and use, or if someone has the energy it can be pushed to Xerces/Xalan or perhaps somewhere in Apache Commons (there's a dormant Commons XML sandbox at http://svn.apache.org/repos/asf/commons/sandbox/xml/ that I started in 2009 for things like this).

        Show
        Jukka Zitting added a comment - Resolving as Won't Fix as there's no obvious place where this would fit well inside Tika. We can leave the code here for people to find and use, or if someone has the energy it can be pushed to Xerces/Xalan or perhaps somewhere in Apache Commons (there's a dormant Commons XML sandbox at http://svn.apache.org/repos/asf/commons/sandbox/xml/ that I started in 2009 for things like this).
        Hide
        Jukka Zitting added a comment -

        Reopening based on discussion in TIKA-692. Could there be a way to implement this without the dependency on Xerces? For example as a decorator for standard ContentHandler.

        Show
        Jukka Zitting added a comment - Reopening based on discussion in TIKA-692 . Could there be a way to implement this without the dependency on Xerces? For example as a decorator for standard ContentHandler.
        Hide
        Uwe Schindler added a comment -

        ...or XALAN

        It's hard to implement without that dependency, as you would need to implement a full XML serializer without that dep. The trick here is to reuse XALAN/XERCES XML serializer for XHTML, but modify it a bit to produce XHTML that is also valid HTML4.

        Show
        Uwe Schindler added a comment - ...or XALAN It's hard to implement without that dependency, as you would need to implement a full XML serializer without that dep. The trick here is to reuse XALAN/XERCES XML serializer for XHTML, but modify it a bit to produce XHTML that is also valid HTML4.
        Hide
        Michael McCandless added a comment -

        Is it so bad to add a dependency on Xerce's/Xalan's serializer.jar?
        It's smallish (~272 KB).

        If that's OK then I can add indenting to the current patch here, and
        we can fix TikaCLI to use this for bug-free XHTML serialization. If
        we take that approach then I think we should undo the other commits on
        TIKA-692, except for the added test cases., since we need not futz
        with the document's style formatting, add whitespace, turn off
        pretty-printing in TikaCLI, etc.?

        Show
        Michael McCandless added a comment - Is it so bad to add a dependency on Xerce's/Xalan's serializer.jar? It's smallish (~272 KB). If that's OK then I can add indenting to the current patch here, and we can fix TikaCLI to use this for bug-free XHTML serialization. If we take that approach then I think we should undo the other commits on TIKA-692 , except for the added test cases., since we need not futz with the document's style formatting, add whitespace, turn off pretty-printing in TikaCLI, etc.?
        Hide
        Jukka Zitting added a comment -

        Is it so bad to add a dependency on Xerce's/Xalan's serializer.jar?

        Yes. I've had numerous battles with XML processing gone haywire in systems that have accidentally pulled in a wrong versions of the XML processing libraries.

        BTW, serializing SAX events to XML, XHTML or HTML4 streams shouldn't be that difficult to implement directly. We could even copy relevant parts of the code from Xalan.

        Show
        Jukka Zitting added a comment - Is it so bad to add a dependency on Xerce's/Xalan's serializer.jar? Yes. I've had numerous battles with XML processing gone haywire in systems that have accidentally pulled in a wrong versions of the XML processing libraries. BTW, serializing SAX events to XML, XHTML or HTML4 streams shouldn't be that difficult to implement directly. We could even copy relevant parts of the code from Xalan.
        Hide
        Michael McCandless added a comment -

        Yes. I've had numerous battles with XML processing gone haywire in systems that have accidentally pulled in a wrong versions of the XML processing libraries.

        OK, that sounds bad – let's not add the dependency.

        BTW, serializing SAX events to XML, XHTML or HTML4 streams shouldn't be that difficult to implement directly. We could even copy relevant parts of the code from Xalan.

        That (poaching/cherry-picking what we need from Xalan) sounds like a reasonable approach here...

        Show
        Michael McCandless added a comment - Yes. I've had numerous battles with XML processing gone haywire in systems that have accidentally pulled in a wrong versions of the XML processing libraries. OK, that sounds bad – let's not add the dependency. BTW, serializing SAX events to XML, XHTML or HTML4 streams shouldn't be that difficult to implement directly. We could even copy relevant parts of the code from Xalan. That (poaching/cherry-picking what we need from Xalan) sounds like a reasonable approach here...

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Raimund Merkert
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development