Details
Description
Found some mails with leading X-headers.
These mails are recognized as text/plain.
One example is CISCOs IronPort, which might add "X-IronPort-AV" to the beginning of mails.
Therefore I would like to discuss if and how TIKA shall handle these cases.
In my opinion TIKA should try to detect files with x-headers and preprocess them to get a valid mail.
Suggestion:
<mime-type type="text/x-tika-x-header"> <magic priority="50"> <match value="X-" type="string" offset="0"> <match value="Message-ID:" type="string" offset="0:8192"/> <match value="From:" type="stringignorecase" offset="0:8192"/> <match value="To:" type="stringignorecase" offset="0:8192"/> <match value="Subject:" type="string" offset="0:8192"/> <match value="MIME-Version:" type="stringignorecase" offset="0:8192"/> </match> </magic> <sub-class-of type="text/x-tika-text-based-message"/> </mime-type>
See also: RFC6648
Attached an example file.
Regards
Andreas