Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3627

OOXML parsing is not working as intended using multiple threads

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.2.0
    • 2.2.1
    • None
    • None

    Description

      In the latest version, the parsing of OOXML files is broken if multiple threads are used. I investigated and compared the call stack between 2.1.0 and 2.2.0, and came to the conclusion that this is caused by this commit in line 86 of OOXMLExtractorFactory.

      In version 2.1.0, the call `ExtractorFactory.setThreadPrefersEventExtractors(true)` is used in every `parse` call, resulting in setting the thread-local property for every thread. In version 2.2.0, the call is used in the static block. This leads to the property being the default value (=false) for all other threads than the first one. Effectively, this breaks the parsing of macros in OOXML files.

      An easy workaround in version 2.2.0 is to call `ExtractorFactory.setAllThreadsPreferEventExtractors(true)` at some time before tika is used first.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bgeisberger Bernhard Geisberger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: