While working on
TIKA-357 to address a similar problem for charset detection, I found an issue with mime identification having to do with the same general problem. Tika right now only deals with the first MimeTypes#getMinLength() bytes of a magic header to do the sniffing of mime type. With the example file attached from Ken Krugler, it's clear that the current min length size of 4 * 1024 bytes isn't enough. Extending it to 8K (8 * 1024 bytes) addresses this issue and seems to open up more opportunity for mime detection at little overhead cost.