Details
-
New Feature
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Would be nice to add in a util/pre-processor to intake any document type (scanned pdf, image, word, pdf, xls, etc.), have something like Apache Tika automatically detect the type, OCR, extract the plain-text, and then feed it to the pipeline.