The PPTX issue manifests itself when a document is being decomposed and searched for a string. For some reason, some whitespace and line carriages are being deleted. If you try to match a Friday that is concatenated with another string (such as "otherFriday"), it will fail. Note that a regular expression match will work, however. This behavior has been observed in 3 of 8 randomly selected pptx downloaded from the internet. However, document identification seems to work just fine, so the only way that some one using the new POI engine would be affected is if they were decomposing attachments and searching for a simple string in them (and they would only be affected on PowerPoint 2007 documents). As noted above, regular expression matching is a workaround that could be employed.
Created attachment 23143 [details] PPTX file to be extracted Please use this PPTX to extract the text. The spaces and carriage returns are removed.
Fixed in r766775 CTTextLineBreak were not properly processed resulting in missing line carriages. Yegor