Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2551

TIka Server uses HtmlParser for XML no matter what config is given, even if XML is disabled in Config

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.17
    • Fix Version/s: 2.0
    • Component/s: server
    • Labels:
      None

      Description

      For some reason, the Tika Server has this line in TikaResource.java

      parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
      

      The upshot of which is that the Tika Server (only) will always use the HtmlParser for XML files, no matter what is configured in the Tika Config. If you disable XML in the Tika Config, or assign it to a different parser, this will be silently ignored

      To test, run the Tika Server with the TIKA-866-valid.xml test file from tika-core/src/test/resources/org/apache/tika/config which uses the EmptyParser for everything. If you ask the server what parsers it has, it correctly reports none at http://localhost:9998/parsers . If you give it an XML file, you'd expect it to fall through to the fallback parser (or possibly empty parser). Instead, it gets processed as html, which is completely unexpected!

      Originally discovered via https://stackoverflow.com/questions/48391615/tell-tika-not-to-parse-xml

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              gagravarr Nick Burch
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: