Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2471

Tab-prefixed message body lines in Mbox interpreted as headers

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.16
    • Fix Version/s: None
    • Component/s: parser
    • Labels:

      Description

      The mbox parser code is overly optimistic. It parses the entire message looking for anything that matches a header pattern, wherever it occurs in a line!

      It looks to me like the parsing logic is in desperate need of a refactor. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case?

      Also, out of curiosity, why does the parser force Windows-1252 as the charset?

        Attachments

        1. mbox
          8 kB
          Matthew Caruana Galizia

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mcaruanagalizia Matthew Caruana Galizia
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: