While upgrading our application to use Tika 1.2 (previously Tika 0.9), a few PowerPoint 97-03 (PPT) files which previously parsed correctly started failing with exceptions in NPOIFS. The root cause appears to be a difference in the way that BAT entries are read from XBAT blocks between POIFSFileSystem and NPOIFSFileSystem. In POIFS, the header's getBATCount is used as a hard-limit for the number of BATs which are read; in NPOIFS, XBATEntriesPerBlock are read for every XBAT, even if this causes more total BAT entries to be read than header.getBATCount. In some files, the extraneous BAT blocks are all initialized to the same value, which is then detected as a possible cycle. The attached PPT file demonstrates this problem (it was found via a web-crawler search for test content, so I can not grant a license to Apache to redistribute it). The attached patch implements similar behavior in NPOIFS to what exists in POIFS, and allows the file to parse without exception.
Created attachment 29315 [details] patch fixing cycle detection in NPOI
Bugzilla isn't letting me upload the file; however, the file may be downloaded from http://www.slideshare.net/jbrenman/thirst.
Thanks for this, slightly modified version committed in r1442095. With that in place, I can now process that slideshare file without problems.