Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1755

Make ppt and pptx paragraph/div breaks more consistent

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      In working on Andreas Beeker's patch for the new handling of PPT/X, I found that our PPT/PPTX parsers behave very differently with <p> and <div> breaks, especially now that we've applied the upgrades from TIKA-1707.

      I propose adding quite a few more <p> to capture the sentence/bullet level breaks in PPTX as we're now doing for PPT.

      There are a handful of other things that we could clean up (table handling) as well.

      Some of these changes may be relevant to this discussion. Shai Erera, any input?

      Patch and example output to follow.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison@apache.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: