Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1755

Make ppt and pptx paragraph/div breaks more consistent

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • None
    • None
    • None

    Description

      In working on kiwiwings's patch for the new handling of PPT/X, I found that our PPT/PPTX parsers behave very differently with <p> and <div> breaks, especially now that we've applied the upgrades from TIKA-1707.

      I propose adding quite a few more <p> to capture the sentence/bullet level breaks in PPTX as we're now doing for PPT.

      There are a handful of other things that we could clean up (table handling) as well.

      Some of these changes may be relevant to this discussion. shaie, any input?

      Patch and example output to follow.

      Attachments

        1. TIKA-1755.patch
          21 kB
          Tim Allison

        Activity

          People

            Unassigned Unassigned
            tallison Tim Allison
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: