Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2688

MBOX not recognized when unknown X-headers are present

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.18
    • Fix Version/s: 1.19, 2.0.0
    • Component/s: detector, mime
    • Labels:
      None

      Description

      This is a spin off from TIKA-2578

      I have mbox files that are not being recognized as such because they have X- headers at the top.

      Current config:

        <mime-type type="application/mbox">
          <!-- MBOX files start with "From [sender] [date]" -->
          <!-- To avoid false matches, check for other headers after that -->
          <magic priority="70">
            <match value="From " type="string" offset="0">
               <match value="\nFrom: " type="string" offset="32:256"/>
               <match value="\nDate: " type="string" offset="32:256"/>
               <match value="\nSubject: " type="string" offset="32:256"/>
               <match value="\nDelivered-To: " type="string" offset="32:256"/>
               <match value="\nReceived: by " type="string" offset="32:256"/>
               <match value="\nReceived: via " type="string" offset="32:256"/>
               <match value="\nReceived: from " type="string" offset="32:256"/>
               <match value="\nMime-Version: " type="string" offset="32:256"/>
            </match>
      

      mbox file:

      From "naveen.andrews@enron.com" Wed Jan 30 18:07:01 2002
      X-EDO-Dataset: EnronData.org Abridged Email Dataset (AED)
      X-EDO-AED-Version: 1.0
      X-EDO-AED-License: Creative Commons Attribution 3.0 United States;
       http://creativecommons.org/licenses/by/3.0/us/;
       To provide attribution, please cite to "EnronData.org."
      X-EDO-AED-ID: 516172
      X-EDO-AED-File: zipper-a/inbox/38.eml
      Message-ID: <8269158.1075842014924.JavaMail.evans@thyme>
      Date: Wed, 30 Jan 2002 15:07:01 -0800 (PST)
      From: naveen.andrews@enron.com
      To: andy.zipper@enron.com
      Subject: RE: Var simulation
      ...
      

      MBOX rule looks for additional headers only in the first 256 bytes, which is not enough when X- headers are present.

      Side-note: prior to 1.17 such mbox was detected as text/plain. As of 1.17 it is detected as message/rfc822 (due to TIKA-2594 that added a rule for Message-ID being present in the first 1000 bytes). Neither is correct!

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tallison Tim Allison
                Reporter:
                yurykats Yury Kats
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: