Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-2895

Avoid accessing binary content if the mimeType is excluded from indexing

    XMLWordPrintableJSON

    Details

      Description

      Currently the recommended way to exclude certain types of files from getting indexed is to add them to EmptyParser in Tika Config. However looking at how Tika works even if mimetype is provided as part metadata.

      Tika Detector try to determine the mimetype by actually reading some bytes from InputStream [1] before looking up from passed MetaData. This would cause unnecessary IO in case large number of binaries are excluded.

      We would need to look for way where any access to binary content which is not being indexed can be avoided. One option can to expose a multi value config property which takes a list of mimetypes to be excluded from indexing. If the mimeType provided as part of JCR data is part of that excluded list then call to Tika should be avoided

      [1] https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446

        Attachments

        1. OAK-2895.patch
          18 kB
          Chetan Mehrotra

          Issue Links

            Activity

              People

              • Assignee:
                chetanm Chetan Mehrotra
                Reporter:
                chetanm Chetan Mehrotra
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: