Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
-
None
Description
In working on kiwiwings's patch for the new handling of PPT/X, I found that our PPT/PPTX parsers behave very differently with <p> and <div> breaks, especially now that we've applied the upgrades from TIKA-1707.
I propose adding quite a few more <p> to capture the sentence/bullet level breaks in PPTX as we're now doing for PPT.
There are a handful of other things that we could clean up (table handling) as well.
Some of these changes may be relevant to this discussion. shaie, any input?
Patch and example output to follow.