I thought about the Parser approach, but it doesn't really feel like a Parser. That is, many different things may be images or have embedded images (PDFs, actual images like JPG, etc., embedded images in Word/PPT docs), so I want to take the MIME type and feed it, optionally to the OCR engine which extracts the images and produces one more items of text, which will give me back something I can then pass along to the Parser.
So, for instance, in the case of a PPT with embedded images, you would:
- Detect PPT
- Extract/OCR Images
- Feed to PPT/POI Parser
- Obtain glory
In a generic sense, what is somewhat needed is a pipeline approach. That being said, I've already got one of those, I just want the library abstraction that Tika gives me to plug and play my OCR tool and get text out of it.
An alternative would be that Parsers for MIME Types that allow for the content to be an image can optionally take in an OCR Engine and as they do their parsing, they look for images.
BTW, for JavaOCR, the main issue seems to be getting training data for the image parsing. Tesseract, on the other hand, has a rich set of models out of the box, but is written in C++ (although it has Java wrappers).