Tika
  1. Tika
  2. TIKA-727

Improve the outputed XHTML by HSLFExtractor

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.10
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      The XHTML output of HSLFExtractor parser is not pure XHTML, it only inserts the full text into a P[aragraph] tag (including non-html carriage returns). This behavior comes from the poor capabilities that the POI PowerPointExtractor offers.

      1. HSLFExtractor.java
        7 kB
        Pablo Queixalos
      2. HSLFExtractor.patch
        2 kB
        Pablo Queixalos

        Activity

        Pablo Queixalos created issue -
        Hide
        Pablo Queixalos added a comment -

        Parser implementation based on what the POI PowerPointExtractor does.

        Show
        Pablo Queixalos added a comment - Parser implementation based on what the POI PowerPointExtractor does.
        Pablo Queixalos made changes -
        Field Original Value New Value
        Attachment HSLFExtractor.java [ 12496089 ]
        Hide
        Nick Burch added a comment -

        Thanks for this, applied with some tweaks in r1174056.

        Looking at the html, there are still some newlines coming through without BR tags between them, so maybe a few more tweaks are still needed?

        Show
        Nick Burch added a comment - Thanks for this, applied with some tweaks in r1174056. Looking at the html, there are still some newlines coming through without BR tags between them, so maybe a few more tweaks are still needed?
        Hide
        Pablo Queixalos added a comment - - edited

        Great !

        The non-breaking-space entities inserted with Author-Comment extraction are missing a semicolon :

        • HSLFExtractor.java:100 xhtml.characters( "&nbsp-&nbsp");
          should be :
        • HSLFExtractor.java:100 xhtml.characters( " - ");

        The attribute name for the high level DIV 'style' should be 'class'

        Sorry for the bad quality of my little contrib.

        MasterSheet data is present for each slide (not pretty). This behavior links to the discussion on TIKA-712.

        Show
        Pablo Queixalos added a comment - - edited Great ! The non-breaking-space entities inserted with Author-Comment extraction are missing a semicolon : HSLFExtractor.java:100 xhtml.characters( "&nbsp-&nbsp"); should be : HSLFExtractor.java:100 xhtml.characters( " - "); The attribute name for the high level DIV 'style' should be 'class' Sorry for the bad quality of my little contrib. MasterSheet data is present for each slide (not pretty). This behavior links to the discussion on TIKA-712 .
        Hide
        Jukka Zitting added a comment - - edited

         

        Note that the XML serializer will automatically escape the character data, so a characters() event like that will result in " " being serialized.

        In this case I'd simply use a normal space, as there doesn't seem to be no compelling reason why a non-breaking space is needed. If one really is needed, I'd use the Unicode NO-BREAK SPACE character \u00a0, though note that it returns false for Character.isWhitespace() which can easily confuse text tokenizers.

        Show
        Jukka Zitting added a comment - - edited   Note that the XML serializer will automatically escape the character data, so a characters() event like that will result in " " being serialized. In this case I'd simply use a normal space, as there doesn't seem to be no compelling reason why a non-breaking space is needed. If one really is needed, I'd use the Unicode NO-BREAK SPACE character \u00a0, though note that it returns false for Character.isWhitespace() which can easily confuse text tokenizers.
        Hide
        Pablo Queixalos added a comment -

        +1 on Jukka's comment.

        Show
        Pablo Queixalos added a comment - +1 on Jukka's comment.
        Hide
        Pablo Queixalos added a comment -

        Looking at the html, there are still some newlines coming through without BR tags between them

        FYI, PDF parser has the same behavior.

        Show
        Pablo Queixalos added a comment - Looking at the html, there are still some newlines coming through without BR tags between them FYI, PDF parser has the same behavior.
        Hide
        Pablo Queixalos added a comment - - edited

        Attachment added : HSLFExtractor.patch

        • Fixed Class for top level div.
        • Fixed bad getFooterText call.
        • Fixed badly typed non-breaking spaces for comments.
        • Improved robustness against null comments and NPE on slide.getShapes().
        Show
        Pablo Queixalos added a comment - - edited Attachment added : HSLFExtractor.patch Fixed Class for top level div. Fixed bad getFooterText call. Fixed badly typed non-breaking spaces for comments. Improved robustness against null comments and NPE on slide.getShapes().
        Pablo Queixalos made changes -
        Attachment HSLFExtractor.patch [ 12496997 ]
        Hide
        Nick Burch added a comment -

        Thanks for the patch, applied with a few tweaks in r1177313.

        For any NPE's in getShapes, any chance you could open POI bugs for any you come across? We shouldn't have them, so would be good to fix them there

        Show
        Nick Burch added a comment - Thanks for the patch, applied with a few tweaks in r1177313. For any NPE's in getShapes, any chance you could open POI bugs for any you come across? We shouldn't have them, so would be good to fix them there
        Hide
        Pablo Queixalos added a comment -

        I just realized that the concerned PPT file is broken, so getting NPE on getShapes() may not be an issue for POI.

        Show
        Pablo Queixalos added a comment - I just realized that the concerned PPT file is broken, so getting NPE on getShapes() may not be an issue for POI.

          People

          • Assignee:
            Unassigned
            Reporter:
            Pablo Queixalos
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development