I'm integrating Apache Tika into my project, and I want to extract (text) information from Powerpoint slides. Both PPT and PPTX
I've noticed when using PPT format, the slide notes are all aggregated at the end of the XML output, and there is no way to identify which note belongs to which slide.
I began looking at the code and found the following:
in HSLFExtractor.java on line 140
I would like to implement this part and contribute