Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-2895

Avoid accessing binary content if the mimeType is excluded from indexing

    XMLWordPrintableJSON

Details

    Description

      Currently the recommended way to exclude certain types of files from getting indexed is to add them to EmptyParser in Tika Config. However looking at how Tika works even if mimetype is provided as part metadata.

      Tika Detector try to determine the mimetype by actually reading some bytes from InputStream [1] before looking up from passed MetaData. This would cause unnecessary IO in case large number of binaries are excluded.

      We would need to look for way where any access to binary content which is not being indexed can be avoided. One option can to expose a multi value config property which takes a list of mimetypes to be excluded from indexing. If the mimeType provided as part of JCR data is part of that excluded list then call to Tika should be avoided

      [1] https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L446

      Attachments

        1. OAK-2895.patch
          18 kB
          Chetan Mehrotra

        Issue Links

          Activity

            People

              chetanm Chetan Mehrotra
              chetanm Chetan Mehrotra
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: