Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1755

Make ppt and pptx paragraph/div breaks more consistent

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In working on Andreas Beeker's patch for the new handling of PPT/X, I found that our PPT/PPTX parsers behave very differently with <p> and <div> breaks, especially now that we've applied the upgrades from TIKA-1707.

      I propose adding quite a few more <p> to capture the sentence/bullet level breaks in PPTX as we're now doing for PPT.

      There are a handful of other things that we could clean up (table handling) as well.

      Some of these changes may be relevant to this discussion. Shai Erera, any input?

      Patch and example output to follow.

      1. TIKA-1755.patch
        21 kB
        Tim Allison

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Current patch gets us this with PPTX:

        <body><div class="slide-content"><table><tr>	<td>Row 1 Col 1</td>	<td>Row 1 Col 2</td>	<td>Row 1 Col 3</td></tr>
        <tr>	<td>Row 2 Col 1</td>	<td>Row 2 Col 2</td>	<td>Row 2 Col 3</td></tr>
        </table>
        <p>Here is a text box</p>
        <p>Footnote appears here[1]</p>
        <p>Bold italic underline superscript subscript</p>
        <p>Here is a list:</p>
        <p>Bullet 1</p>
        <p>Bullet 2</p>
        <p>Bullet 3</p>
        <p>Here is a numbered list:</p>
        <p>Number bullet 1</p>
        <p>Number bullet 2</p>
        <p>Number bullet 3</p>
        <p> Keyword1 Keyword2</p>
        <p>This is a hyperlink</p>
        <p> Subject is here</p>
        <p>Suddenly some Japanese text:</p>
        <p>????????????</p>
        <p>?????</p>
        <p>And then some Gothic text:</p>
        <p>??????</p>
        <p>Here is a citation:</p>
        <p>(Kramer)</p>
        <p>Figure 1 This is a caption for Figure 1</p>
        <p>
        </p>
        <p>Row 1 column 1</p>
        <p>Row 2 column 1</p>
        <p>Row 1 column 2</p>
        <p>Row 2 column 2</p>
        <p>
        </p>
        <p>
        </p>
        <p>[1] This is a footnote.</p>
        </div>
        <div class="slide-master-content" />
        <div class="slide-notes"><p>1</p>
        <p>This is the footer text.</p>
        <p>This is the header text.</p>
        </div>
        <div class="embedded" id="/docProps/thumbnail.jpeg" /></body></html>
        

        and this for PPT

        <body><div class="slideShow"><div class="slide"><div class="slide-master-content" />
        <div class="slide-content"><p />
        <p />
        <p />
        <p>Here is a text box</p>
        <p />
        <p>Footnote appears here[1]</p>
        <p>Bolditalicunderlinesuperscriptsubscript</p>
        <p>Here is a list:</p>
        <p>Bullet 1</p>
        <p>Bullet 2</p>
        <p>Bullet 3</p>
        <p>Here is a numbered list:</p>
        <p>Number bullet 1</p>
        <p>Number bullet 2</p>
        <p>Number bullet 3</p>
        <p>Keyword1 Keyword2</p>
        <p>This is a hyperlink</p>
        <p>Subject is here</p>
        <p>Suddenly some Japanese text:</p>
        <p>????????????</p>
        <p>?????</p>
        <p>And then some Gothic text:</p>
        <p>??????</p>
        <p>Here is a citation:</p>
        <p>(Kramer)</p>
        <p>Figure 1 This is a caption for Figure 1</p>
        <p />
        <p>Row 1 column 1</p>
        <p>Row 2 column 1</p>
        <p>Row 1 column 2</p>
        <p>Row 2 column 2</p>
        <p />
        <p />
        <p />
        <p>[1]This is a footnote.</p>
        </div>
        <table><tr>	<td>Row 1 Col 1</td>	<td>Row 1 Col 2</td>	<td>Row 1 Col 3</td></tr>
        <tr>	<td>Row 2 Col 1</td>	<td>Row 2 Col 2</td>	<td>Row 2 Col 3</td></tr>
        </table>
        </div>
        </div>
        <div class="slide-notes"><p />
        <p>*</p>
        <p>This is the footer text.</p>
        <p>This is the header text.</p>
        </div>
        </body></html>
        
        Show
        tallison@mitre.org Tim Allison added a comment - Current patch gets us this with PPTX: <body><div class="slide-content"><table><tr> <td>Row 1 Col 1</td> <td>Row 1 Col 2</td> <td>Row 1 Col 3</td></tr> <tr> <td>Row 2 Col 1</td> <td>Row 2 Col 2</td> <td>Row 2 Col 3</td></tr> </table> <p>Here is a text box</p> <p>Footnote appears here[1]</p> <p>Bold italic underline superscript subscript</p> <p>Here is a list:</p> <p>Bullet 1</p> <p>Bullet 2</p> <p>Bullet 3</p> <p>Here is a numbered list:</p> <p>Number bullet 1</p> <p>Number bullet 2</p> <p>Number bullet 3</p> <p> Keyword1 Keyword2</p> <p>This is a hyperlink</p> <p> Subject is here</p> <p>Suddenly some Japanese text:</p> <p>????????????</p> <p>?????</p> <p>And then some Gothic text:</p> <p>??????</p> <p>Here is a citation:</p> <p>(Kramer)</p> <p>Figure 1 This is a caption for Figure 1</p> <p> </p> <p>Row 1 column 1</p> <p>Row 2 column 1</p> <p>Row 1 column 2</p> <p>Row 2 column 2</p> <p> </p> <p> </p> <p>[1] This is a footnote.</p> </div> <div class="slide-master-content" /> <div class="slide-notes"><p>1</p> <p>This is the footer text.</p> <p>This is the header text.</p> </div> <div class="embedded" id="/docProps/thumbnail.jpeg" /></body></html> and this for PPT <body><div class="slideShow"><div class="slide"><div class="slide-master-content" /> <div class="slide-content"><p /> <p /> <p /> <p>Here is a text box</p> <p /> <p>Footnote appears here[1]</p> <p>Bolditalicunderlinesuperscriptsubscript</p> <p>Here is a list:</p> <p>Bullet 1</p> <p>Bullet 2</p> <p>Bullet 3</p> <p>Here is a numbered list:</p> <p>Number bullet 1</p> <p>Number bullet 2</p> <p>Number bullet 3</p> <p>Keyword1 Keyword2</p> <p>This is a hyperlink</p> <p>Subject is here</p> <p>Suddenly some Japanese text:</p> <p>????????????</p> <p>?????</p> <p>And then some Gothic text:</p> <p>??????</p> <p>Here is a citation:</p> <p>(Kramer)</p> <p>Figure 1 This is a caption for Figure 1</p> <p /> <p>Row 1 column 1</p> <p>Row 2 column 1</p> <p>Row 1 column 2</p> <p>Row 2 column 2</p> <p /> <p /> <p /> <p>[1]This is a footnote.</p> </div> <table><tr> <td>Row 1 Col 1</td> <td>Row 1 Col 2</td> <td>Row 1 Col 3</td></tr> <tr> <td>Row 2 Col 1</td> <td>Row 2 Col 2</td> <td>Row 2 Col 3</td></tr> </table> </div> </div> <div class="slide-notes"><p /> <p>*</p> <p>This is the footer text.</p> <p>This is the header text.</p> </div> </body></html>
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Initial patch

        Show
        tallison@mitre.org Tim Allison added a comment - Initial patch
        Hide
        kiwiwings Andreas Beeker added a comment -

        I think, the goal would be, to modify common sl in such a way, that there's only one tika parser class necessary using SlideShowFactory and having the same results for PPT/X.
        I already know a few drawbacks of the current implementation:

        • line breaks are part of the hslf text runs whereas in xslf these are explicit tokens
        • tables are group shapes in hslf, but not in xslf ... but I guess this doesn't matter for tika

        Currently my main goal for POI is to minimize our critical sonar issues ... if this tika issue is important to you, drop me a line and I try to adapt this for POI 3.14-beta1 ...

        Show
        kiwiwings Andreas Beeker added a comment - I think, the goal would be, to modify common sl in such a way, that there's only one tika parser class necessary using SlideShowFactory and having the same results for PPT/X. I already know a few drawbacks of the current implementation: line breaks are part of the hslf text runs whereas in xslf these are explicit tokens tables are group shapes in hslf, but not in xslf ... but I guess this doesn't matter for tika Currently my main goal for POI is to minimize our critical sonar issues ... if this tika issue is important to you, drop me a line and I try to adapt this for POI 3.14-beta1 ...
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Y, you've got plenty of bigger fish to fry, and the common sl contribution is huge! This issue is small potatoes: what can we do now within Tika to get our representation more equivalent between the old and the new without upsetting current users.

        Show
        tallison@mitre.org Tim Allison added a comment - Y, you've got plenty of bigger fish to fry, and the common sl contribution is huge! This issue is small potatoes: what can we do now within Tika to get our representation more equivalent between the old and the new without upsetting current users.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        r1707432

        Show
        tallison@mitre.org Tim Allison added a comment - r1707432
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #866 (See https://builds.apache.org/job/tika-trunk-jdk1.7/866/)
        TIKA-1755 make div and other formatting more consistent btwn PPT and PPTX (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1707432)

        • trunk/CHANGES.txt
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
        • trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        • trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • trunk/tika-parsers/src/test/resources/test-documents/testPPT_comment.ppt
        • trunk/tika-parsers/src/test/resources/test-documents/testPPT_comment.pptx
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #866 (See https://builds.apache.org/job/tika-trunk-jdk1.7/866/ ) TIKA-1755 make div and other formatting more consistent btwn PPT and PPTX (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1707432 ) trunk/CHANGES.txt trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java trunk/tika-parsers/src/test/resources/test-documents/testPPT_comment.ppt trunk/tika-parsers/src/test/resources/test-documents/testPPT_comment.pptx

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development