Details
Description
Currently the recommended way to exclude certain types of files from getting indexed is to add them to EmptyParser in Tika Config. However looking at how Tika works even if mimetype is provided as part metadata.
Tika Detector try to determine the mimetype by actually reading some bytes from InputStream [1] before looking up from passed MetaData. This would cause unnecessary IO in case large number of binaries are excluded.
We would need to look for way where any access to binary content which is not being indexed can be avoided. One option can to expose a multi value config property which takes a list of mimetypes to be excluded from indexing. If the mimeType provided as part of JCR data is part of that excluded list then call to Tika should be avoided
Attachments
Attachments
Issue Links
- is related to
-
OAK-5048 Upgrade to Tika 1.15 version
- Closed