Description
For some reason, the Tika Server has this line in TikaResource.java
parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
The upshot of which is that the Tika Server (only) will always use the HtmlParser for XML files, no matter what is configured in the Tika Config. If you disable XML in the Tika Config, or assign it to a different parser, this will be silently ignored
To test, run the Tika Server with the TIKA-866-valid.xml test file from tika-core/src/test/resources/org/apache/tika/config which uses the EmptyParser for everything. If you ask the server what parsers it has, it correctly reports none at http://localhost:9998/parsers . If you give it an XML file, you'd expect it to fall through to the fallback parser (or possibly empty parser). Instead, it gets processed as html, which is completely unexpected!
Originally discovered via https://stackoverflow.com/questions/48391615/tell-tika-not-to-parse-xml