Tika grabs the text in the various boxes/shapes and combines it into one word.
This presentation contains a slide that has one text box containing the text 'TextBox 2', a shape containing the text 'invisible', and another shape containing the text 'ooooohhhhhdang'. The result of parsing is 'TextBox 2invisibleoooohhhhhdang'.
I think we just have to start/end p element when we see sf in the doc...
Patch w/ test case & fix.
We are also seeing this behavior with multiple bullet points running together. I didn't want to open up another ticket for this, so I'm merely commenting on that issue here.
Thanks Erik, I confirmed that Tika trunk does that and that this patch fixes it; I'll add another test case for it....