Tika
  1. Tika
  2. TIKA-895

Empty title element makes Tika-generated HTML documents not open

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.3
    • Component/s: metadata
    • Labels:
    • Environment:

      Windows 7

      Description

      I try to transform an empty docx to an html file.
      Ex : java -jar tika-app-1.1.jar -x example.docx > t.html

      The html file can't be open with Firefox,Internet Explorer and Chrome.

      The main point is that <title/> seems to be forbiden by html specification (can't get the point on html5)

      http://www.w3.org/TR/html401/struct/global.html#h-7.4.2

      7.4.2 The TITLE element

      <!-- The TITLE element is not considered part of the flow of text.

      It should be displayed, for example as the page header or

      window title. Exactly one title is required per document.

      -->

      <!ELEMENT TITLE <http://www.w3.org/TR/html401/struct/global.html#edef-TITLE> - - (#PCDATA) -(%head.misc;

      <http://www.w3.org/TR/html401/sgml/dtd.html#head.misc> ) – document title -->

      <!ATTLIST TITLE %i18n <http://www.w3.org/TR/html401/sgml/dtd.html#i18n> >

      Start tag: required, End tag: required

      For information there was the same bug with xls
      https://issues.apache.org/jira/browse/TIKA-725

      The simple solution should be to provide an empty title by default

        Issue Links

          Activity

          Hide
          Konstantin Gribov added a comment -

          Still actual for tika 1.2. Fix was in TIKA-725 (by Jukka Zitting).

          Show
          Konstantin Gribov added a comment - Still actual for tika 1.2. Fix was in TIKA-725 (by Jukka Zitting ).
          Hide
          Ray Gauss II added a comment -

          Reopening to resolve as fixed rather than duplicate.

          Show
          Ray Gauss II added a comment - Reopening to resolve as fixed rather than duplicate.
          Hide
          Ray Gauss II added a comment -

          When a TransformerHandler is used the actual writing of the final elements is delegated to an XML serializer such as ToHTMLStream which extends ToStream.

          When ToStream.characters is called with zero length it returns immediately and does not close the start tag of the current element, and ToStream.endElement checks whether the start tag is open to determine whether or not to close as <title/> or <title></title>.

          It seems the code brought over from the xalan project to the JDK was locked down quite a bit during the transition. When using xalan directly an alternate XML serializer can be specified via XSLT or other means [1], but in the JDK that functionality seems to have been removed as TransletOutputHandlerFactory.getSerializationHandler has ToHTMLStream hard-coded.

          Additionally, ToHTMLStream is declared as final and the majority of the classes which one would normally extend to use a different TransletOutputHandlerFactory are internal, so a proper solution would likely involve depending on xalan directly or duplicating a whole lot of code, neither of which is ideal.

          As a workaround, a ExpandedTitleContentHandler content handler decorator was added which checks for the previous fix for this issue, a call to characters(new char[0], 0, 0) for the title element, and if present changes the length to 1 then catches the expected ArrayIndexOutOfBoundsException thrown by ToStream.characters.

          The result is that the title start tag is closed since the check for zero length passes and no character writing is attempted.

          TikaCLI was modified to wrap the transformer handler returned by SAXTransformerFactory for the html output method, so only handling of the title tag for HTML output will be affected by the change.

          In the event that this approach has adverse effects for those using XML serializers other than those present in the JDK, the change to TikaCLI can be reverted or made an option.

          Those calling Tika programmatically will need to wrap their transformer handlers in a ExpandedTitleContentHandler as well, i.e.:

              ...
              SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
              TransformerHandler handler = factory.newTransformerHandler();
              handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
              handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent);
              handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding);
              handler.setResult(new StreamResult(output));
              return new ExpandedTitleContentHandler(handler);
          

          Resolved in r1423538.

          [1] http://xml.apache.org/xalan-j/usagepatterns.html

          Show
          Ray Gauss II added a comment - When a TransformerHandler is used the actual writing of the final elements is delegated to an XML serializer such as ToHTMLStream which extends ToStream . When ToStream.characters is called with zero length it returns immediately and does not close the start tag of the current element, and ToStream.endElement checks whether the start tag is open to determine whether or not to close as <title/> or <title></title> . It seems the code brought over from the xalan project to the JDK was locked down quite a bit during the transition. When using xalan directly an alternate XML serializer can be specified via XSLT or other means [1] , but in the JDK that functionality seems to have been removed as TransletOutputHandlerFactory.getSerializationHandler has ToHTMLStream hard-coded. Additionally, ToHTMLStream is declared as final and the majority of the classes which one would normally extend to use a different TransletOutputHandlerFactory are internal, so a proper solution would likely involve depending on xalan directly or duplicating a whole lot of code, neither of which is ideal. As a workaround, a ExpandedTitleContentHandler content handler decorator was added which checks for the previous fix for this issue, a call to characters(new char [0] , 0, 0) for the title element, and if present changes the length to 1 then catches the expected ArrayIndexOutOfBoundsException thrown by ToStream.characters . The result is that the title start tag is closed since the check for zero length passes and no character writing is attempted. TikaCLI was modified to wrap the transformer handler returned by SAXTransformerFactory for the html output method, so only handling of the title tag for HTML output will be affected by the change. In the event that this approach has adverse effects for those using XML serializers other than those present in the JDK, the change to TikaCLI can be reverted or made an option. Those calling Tika programmatically will need to wrap their transformer handlers in a ExpandedTitleContentHandler as well, i.e.: ... SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html" ); handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent); handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding); handler.setResult( new StreamResult(output)); return new ExpandedTitleContentHandler(handler); Resolved in r1423538. [1] http://xml.apache.org/xalan-j/usagepatterns.html
          Hide
          Chris A. Mattmann added a comment -

          Thanks Ray, appreciate it!!

          Show
          Chris A. Mattmann added a comment - Thanks Ray, appreciate it!!

            People

            • Assignee:
              Ray Gauss II
              Reporter:
              Benoit MAGGI
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development