Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3422

Excluding both WMFParser and EMFParser causes wmf instances NOT to appear at all

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.26
    • None
    • core

    Description

      I was attempting to exclude embedded wmf and emf files from being parsed, but I noticed that when I do so, only instances of EMF files are noted by Tika in the returned /rmeta/text

      As an experiment I created two tika-config.xml files. The first excludes only the WMFParser, and when my MSWord source doc is processed I see lines like this, as expected:

      "Content-Type":"image/wmf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.EmptyParser"]

      And there are the EMF files that were found and parsed by the EMFParser:

      "Content-Type":"image/emf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.EMFParser"]

       

      A problem arises though when I try to exclude WMFParser AND EMFParser. Suddenly any WMF instances have disappeared and only EMF instances are shown as being handled by the EmptyParser. 

      "Content-Type":"image/emf","X-Parsed-By":["org.apache.tika.parser.CompositeParser","org.apache.tika.parser.EmptyParser"]

       

      I think in the 2nd case BOTH types should be shown as being handled by the EmptyParser. I still want to know that the WMF files are in the container even though I'm not parsing them.

       

      P.S. For whatever reason I can't upload the original Word doc that I'm testing with. Jira won't allow me.

       

       

       

      Attachments

        1. tika-config_no_wmf.xml
          0.5 kB
          Josh Burchard
        2. tika-config_no_emf_or_wmf.xml
          0.6 kB
          Josh Burchard

        Activity

          People

            Unassigned Unassigned
            jbhcl Josh Burchard
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: