Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2578

Mails not recognized when unknown X-headers are present

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.17, 1.18, 2.0.0
    • 1.18, 2.0.0
    • detector, mime
    • None

    Description

      Found some mails with leading X-headers.

      These mails are recognized as text/plain.

      One example is CISCOs IronPort, which might add "X-IronPort-AV" to the beginning of mails.

      Therefore I would like to discuss if and how TIKA shall handle these cases.

      In my opinion TIKA should try to detect files with x-headers and preprocess them to get a valid mail.

      Suggestion:

      <mime-type type="text/x-tika-x-header">
        <magic priority="50">
          <match value="X-" type="string" offset="0">
            <match value="Message-ID:" type="string" offset="0:8192"/>
            <match value="From:" type="stringignorecase" offset="0:8192"/>
            <match value="To:" type="stringignorecase" offset="0:8192"/>
            <match value="Subject:" type="string" offset="0:8192"/>
            <match value="MIME-Version:" type="stringignorecase" offset="0:8192"/>
          </match>
        </magic>
        <sub-class-of type="text/x-tika-text-based-message"/>
      </mime-type>
      

      See also: RFC6648

      Attached an example file.

      Regards

      Andreas

      Attachments

        1. testRFC822_with_leading_x_header
          2 kB
          Andreas Meier

        Issue Links

          Activity

            People

              tallison Tim Allison
              AndreasMeier Andreas Meier
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: