Created attachment 26160 [details] svn diff output Proposed is a SimpleExtractor and XSSFSimpleWorkbook in order to use a more efficient way of parsing an XSL spreadsheets in Tika (SAX based parsing). This is related to Tika-521 (https://issues.apache.org/jira/browse/TIKA-521). Testcases will follow when the proposed approach is approved.
Created attachment 26161 [details] New classes
I've done some refactoring of XSSFEventBasedExcelExtractor in r1036968, which should help with the Tika side when it comes to outputting the values as XHTML Next I'll need to expand on your XSSFSimpleWorkbook to cover all the different file parts we might need to replicate the functionality in XSSFExcelExtractorDecorator (may need some more POI refactoring as well as new code) Finally, we'd then need to go to the Tika side and update XSSFExcelExtractorDecorator to use the new simple workbook + implement a SheetContentsHandler which generates the xhtml events
I've done some more work in r1037753. We can now use XSSFEventBasedExcelExtractor, wire in our own way to get at the text, and get at commends + headers.