Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2551

TIka Server uses HtmlParser for XML no matter what config is given, even if XML is disabled in Config

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.17
    • 2.0.0
    • server
    • None

    Description

      For some reason, the Tika Server has this line in TikaResource.java

      parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
      

      The upshot of which is that the Tika Server (only) will always use the HtmlParser for XML files, no matter what is configured in the Tika Config. If you disable XML in the Tika Config, or assign it to a different parser, this will be silently ignored

      To test, run the Tika Server with the TIKA-866-valid.xml test file from tika-core/src/test/resources/org/apache/tika/config which uses the EmptyParser for everything. If you ask the server what parsers it has, it correctly reports none at http://localhost:9998/parsers . If you give it an XML file, you'd expect it to fall through to the fallback parser (or possibly empty parser). Instead, it gets processed as html, which is completely unexpected!

      Originally discovered via https://stackoverflow.com/questions/48391615/tell-tika-not-to-parse-xml

      Attachments

        Activity

          People

            Unassigned Unassigned
            nick Nick Burch
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: