Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None
    • Flags:
      Patch

      Description

      In the not so far future, POI 3.13 Beta 2 will be available.
      This contains a quite big change to the Powerpoint modules XSLF/HSLF, but thankfully TIKA isn't much affected.
      Please try the patch on our trunk and post side-effects.

      As the work on the common_sl api hasn't been finished yet, there might be another patch for the next POI beta version.

      1. 075166.ppt
        1.13 MB
        Tim Allison
      2. common_sl.diff
        15 kB
        Andreas Beeker
      3. dont_trim_and_bullets.patch
        2 kB
        Andreas Beeker

        Issue Links

          Activity

          Hide
          kiwiwings Andreas Beeker added a comment -

          patch for POI 3.13 Beta 2 integration.
          (Although its origin is the common_sl change, I haven't moved the code to a common extractor for both formats (HSLF/XSLF) yet)

          Show
          kiwiwings Andreas Beeker added a comment - patch for POI 3.13 Beta 2 integration. (Although its origin is the common_sl change, I haven't moved the code to a common extractor for both formats (HSLF/XSLF) yet)
          Hide
          gagravarr Nick Burch added a comment -

          Looks good so far to me!

          One quick thing though - any chance you could review http://tika.apache.org/contribute.html#Code_Formatting and tweak your formatting settings accordingly? (I think your IDE isn't quite right for Tika, for imports at least)

          Show
          gagravarr Nick Burch added a comment - Looks good so far to me! One quick thing though - any chance you could review http://tika.apache.org/contribute.html#Code_Formatting and tweak your formatting settings accordingly? (I think your IDE isn't quite right for Tika, for imports at least)
          Hide
          kiwiwings Andreas Beeker added a comment -

          ok ... I've changed the import settings and executed organize imports.
          I couldn't validate the tests again, as the new grobid dependency somehow broke my build ... but apart of that, the rest is the same ...

          Show
          kiwiwings Andreas Beeker added a comment - ok ... I've changed the import settings and executed organize imports. I couldn't validate the tests again, as the new grobid dependency somehow broke my build ... but apart of that, the rest is the same ...
          Hide
          gagravarr Nick Burch added a comment -

          The build is hopefully working again now. If you could re-test, that'd be wonderful!

          Show
          gagravarr Nick Burch added a comment - The build is hopefully working again now. If you could re-test, that'd be wonderful!
          Hide
          kiwiwings Andreas Beeker added a comment -

          The affected test cases are ok now ... I haven't tried the full fledged tika test suite, as my JRE chokes on the 2GB heap settings, but tika-parsers seems to be ok with 1GB

          Show
          kiwiwings Andreas Beeker added a comment - The affected test cases are ok now ... I haven't tried the full fledged tika test suite, as my JRE chokes on the 2GB heap settings, but tika-parsers seems to be ok with 1GB
          Hide
          tallison@mitre.org Tim Allison added a comment -

          r1706079.

          Thank you, Andreas Beeker!

          Apologies to you and Nick Burch for not remembering that you had already fixed the upgrades with this patch when I opened the duplicate TIKA-1748.

          Show
          tallison@mitre.org Tim Allison added a comment - r1706079. Thank you, Andreas Beeker ! Apologies to you and Nick Burch for not remembering that you had already fixed the upgrades with this patch when I opened the duplicate TIKA-1748 .
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #860 (See https://builds.apache.org/job/tika-trunk-jdk1.7/860/)
          TIKA-1707: upgrade to POI 3.13 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1706079)

          • trunk/CHANGES.txt
          • trunk/tika-parsers/pom.xml
          • trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
          • trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
          • trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #860 (See https://builds.apache.org/job/tika-trunk-jdk1.7/860/ ) TIKA-1707 : upgrade to POI 3.13 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1706079 ) trunk/CHANGES.txt trunk/tika-parsers/pom.xml trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Andreas Beeker, we found a regression in spacing around differently formatted runs in ppt (here and TIKA-1778). Do you see any problems if we don't trim here:

          for (HSLFTextRun htr : htp.getTextRuns()) {
              String line = htr.getRawText();
                  if (line != null) {
                      for (String fragment : line.split("\\u000b")){
                             ....
                          xhtml.characters(fragment.trim());
          ...
          

          If we drop it, will we get spaces where we shouldn't?

          Perhaps trim the trailing carriage return?

          fragment.replaceFirst("\r$", " ")
          Show
          tallison@mitre.org Tim Allison added a comment - - edited Andreas Beeker , we found a regression in spacing around differently formatted runs in ppt ( here and TIKA-1778 ). Do you see any problems if we don't trim here: for (HSLFTextRun htr : htp.getTextRuns()) { String line = htr.getRawText(); if (line != null) { for (String fragment : line.split("\\u000b")){ .... xhtml.characters(fragment.trim()); ... If we drop it, will we get spaces where we shouldn't? Perhaps trim the trailing carriage return? fragment.replaceFirst("\r$", " ")
          Hide
          kiwiwings Andreas Beeker added a comment -

          I would replace it with the empty string and use the regex escape for the line break

              fragment.replaceFirst("\\r$", "")
          

          Apart of that, I've added a patch for bullet lists.
          Currently HSLF always returns false for super/subscript ... I need to change this in POI.

          Please comment, if it makes sense to add further markup information.

          Show
          kiwiwings Andreas Beeker added a comment - I would replace it with the empty string and use the regex escape for the line break fragment.replaceFirst( "\\r$" , "") Apart of that, I've added a patch for bullet lists. Currently HSLF always returns false for super/subscript ... I need to change this in POI. Please comment, if it makes sense to add further markup information.
          Hide
          kiwiwings Andreas Beeker added a comment -

          Patch for trim-replacement and bullet lists

          Show
          kiwiwings Andreas Beeker added a comment - Patch for trim-replacement and bullet lists
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you! That fixed the vast majority of content diffs. There are still 6 ppts where we aren't adding a break between lines and we're getting improper concatenation of terms across lines. Should we move boolean isFirst = true; above for (HSLFTextRun htr : textRuns) {?

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you! That fixed the vast majority of content diffs. There are still 6 ppts where we aren't adding a break between lines and we're getting improper concatenation of terms across lines. Should we move boolean isFirst = true; above for (HSLFTextRun htr : textRuns) { ?
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Example file...

          Jeff Heimerman is one line and Senior ... is another line, but we aren't currently adding a <br/>

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Example file... Jeff Heimerman is one line and Senior ... is another line , but we aren't currently adding a <br/>
          Hide
          tallison@mitre.org Tim Allison added a comment -

          That was a bad idea.

          The issue was that the first run ended with \u000b, and the split was hiding that "paragraph break" before the next run.

          So, how about adding the if line.endsWith("\u000b"):

                             if (line != null) {
                                  boolean isfirst = true;
                                  for (String fragment : line.split("\\u000b")) {
                                      if (!isfirst) {
                                          xhtml.startElement("br");
                                          xhtml.endElement("br");
                                      }
                                      isfirst = false;
                                      xhtml.characters(removePBreak(fragment));
                                  }
                                  if (line.endsWith("\u000b")) {
                                      xhtml.startElement("br");
                                      xhtml.endElement("br");
                                  }
                              }
          
          Show
          tallison@mitre.org Tim Allison added a comment - That was a bad idea. The issue was that the first run ended with \u000b, and the split was hiding that "paragraph break" before the next run. So, how about adding the if line.endsWith("\u000b") : if (line != null) { boolean isfirst = true; for (String fragment : line.split("\\u000b")) { if (!isfirst) { xhtml.startElement("br"); xhtml.endElement("br"); } isfirst = false; xhtml.characters(removePBreak(fragment)); } if (line.endsWith("\u000b")) { xhtml.startElement("br"); xhtml.endElement("br"); } }

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              kiwiwings Andreas Beeker
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development