Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-7996

Ability to disable automatic text extraction via configuration

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      This issue is to discuss allowing a user to disable automatic text extraction of binary data via a configuration file.

      Currently you can save a tika.config file inside an index definition, which overrides the default Tika configuration for that index.  You can use this approach to disable automatic text extraction.

      I'd like to be able to do this at a global level - not per-index - via a configuration file instead.  Then inside the document maker code somewhere, we would check to see whether the candidate for text extraction has been disabled by configuration.

      The value in this approach is that two instances can be identical in terms of index definitions, only differing in local configuration.  Separate index definitions don't have to be maintained.  And if you want to change which files you extract text, you don't have to refresh an index to make it happen.

      Attachments

        Activity

          People

            mattvryan Matt Ryan
            mattvryan Matt Ryan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: