Description
This is a spin off from TIKA-2578
I have mbox files that are not being recognized as such because they have X- headers at the top.
Current config:
<mime-type type="application/mbox"> <!-- MBOX files start with "From [sender] [date]" --> <!-- To avoid false matches, check for other headers after that --> <magic priority="70"> <match value="From " type="string" offset="0"> <match value="\nFrom: " type="string" offset="32:256"/> <match value="\nDate: " type="string" offset="32:256"/> <match value="\nSubject: " type="string" offset="32:256"/> <match value="\nDelivered-To: " type="string" offset="32:256"/> <match value="\nReceived: by " type="string" offset="32:256"/> <match value="\nReceived: via " type="string" offset="32:256"/> <match value="\nReceived: from " type="string" offset="32:256"/> <match value="\nMime-Version: " type="string" offset="32:256"/> </match>
mbox file:
From "naveen.andrews@enron.com" Wed Jan 30 18:07:01 2002 X-EDO-Dataset: EnronData.org Abridged Email Dataset (AED) X-EDO-AED-Version: 1.0 X-EDO-AED-License: Creative Commons Attribution 3.0 United States; http://creativecommons.org/licenses/by/3.0/us/; To provide attribution, please cite to "EnronData.org." X-EDO-AED-ID: 516172 X-EDO-AED-File: zipper-a/inbox/38.eml Message-ID: <8269158.1075842014924.JavaMail.evans@thyme> Date: Wed, 30 Jan 2002 15:07:01 -0800 (PST) From: naveen.andrews@enron.com To: andy.zipper@enron.com Subject: RE: Var simulation ...
MBOX rule looks for additional headers only in the first 256 bytes, which is not enough when X- headers are present.
Side-note: prior to 1.17 such mbox was detected as text/plain. As of 1.17 it is detected as message/rfc822 (due to TIKA-2594 that added a rule for Message-ID being present in the first 1000 bytes). Neither is correct!
Attachments
Issue Links
- relates to
-
TIKA-2578 Mails not recognized when unknown X-headers are present
- Resolved