Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.0
-
None
-
None
Description
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I detected some hundreds of samples from 1M of different file types now are being detected as EML. This is caused by the <match value="\nX-" type="string" offset="0:1024"/> rule added in TIKA-3687 in the minShouldMatch="2" clause. Attached is a sample PNG file that triggers this (it also has another \nDate: value in the first 1024 bytes).
Another not related thing, I tried to override the message/rfc822 mime definition with a custom-tika-mimetypes.xml in classpath, but it had no effect. It used to work in Tika-1.x. Was that change intentional? I think user definitions should take precedence over Tika definitions, since they can change depending on domain or context (e.g. the same extension may be used by different applications). If it wasn't intentional, I'll open other issue.