Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-667

Changes to RFC822Parser to support turning off strict parsing

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.10
    • 0.10
    • parser
    • None

    Description

      Currently in RFC822Parser if Apache-Mime4J fails while parsing any field, then parsing the whole document will fail. This causes problems on the Enron Corpus - see https://issues.apache.org/jira/browse/TIKA-657

      RFC822Parser is configured from a MimeEntityConfig object. MimeEntityConfig contains an option for "strict parsing". Currently MailContentHandler only performs strict parsing, I.E. if a MimeException is encountered when processing any fields in MailContentHandler.field then processing the document fails. However, we may prefer not to have strict parsing I.E. continue even if processing one or more fields fails. This can be achieved by placing a try / catch block around the logic inside MailContentHandler.field(), and only rethrowing the error if strictParsing is enabled, otherwise we log the error.

      I enclose a diff for RFC822Parser and MailContentHandler that does this. I have also made some other minor changes to MailContentHandler: there was some repeated code for handling To:, Cc: and Bcc: fields, so I have replaced that with a single private method, and rewritten stripOutFieldPrefix, to avoid manipulating the String using re-assignment.

      Attachments

        1. mailparser.diff
          9 kB
          Mark Butler

        Activity

          People

            jukkaz Jukka Zitting
            butlermh Mark Butler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: