Tika
  1. Tika
  2. TIKA-657

Email parser gets into trouble on malformed html in enron corpus

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      There is a very large corpus of email addresses available: http://www.cs.cmu.edu/~enron/.

      In processing even a subset of this corpus, I see numerous 'unexpected RuntimeException' errors resulting from tagsoup throwing on truly awful html. It seems to me that being able to do something with this entire stack would make a good '1.0' criteria for tika's email parser.

        Activity

        Hide
        Jukka Zitting added a comment -

        I was able to process the entire Enron corpus without problems, so resolving as fixed.

        Show
        Jukka Zitting added a comment - I was able to process the entire Enron corpus without problems, so resolving as fixed.
        Hide
        Jukka Zitting added a comment -

        In revision 1183109 I increased the default line and header length limits to cope with valid messages in the Enron corpus. With that change I saw no more exceptions at least in the first 30k messages.

        I'll run a full test over the entire corpus to see if there are any other problems left before we can resolve this issue.

        Show
        Jukka Zitting added a comment - In revision 1183109 I increased the default line and header length limits to cope with valid messages in the Enron corpus. With that change I saw no more exceptions at least in the first 30k messages. I'll run a full test over the entire corpus to see if there are any other problems left before we can resolve this issue.
        Hide
        Mark Butler added a comment -

        I have submitted code to support turning off strict parsing as issue https://issues.apache.org/jira/browse/TIKA-667

        Show
        Mark Butler added a comment - I have submitted code to support turning off strict parsing as issue https://issues.apache.org/jira/browse/TIKA-667
        Hide
        Mark Butler added a comment -

        Summary of exceptions thrown when processing Enron corpus using Tika 1.0 snapshot (head)

        Show
        Mark Butler added a comment - Summary of exceptions thrown when processing Enron corpus using Tika 1.0 snapshot (head)
        Hide
        Mark Butler added a comment -

        Summary of exceptions thrown when processing Enron corpus using Tika 0.9

        Show
        Mark Butler added a comment - Summary of exceptions thrown when processing Enron corpus using Tika 0.9
        Hide
        Mark Butler added a comment -

        I took the Enron dataset and processed it using Tika and Behemoth. It contains 517,424 documents.

        Using Tika 0.9 I encountered runtime errors on 27,224 documents. Sorting the exceptions, there were four different stack traces. I enclose a summary of these exceptions below. However I did not see the problems with Tagsoup parsing that Benson reports?

        I then took the version of Tika in head. Here I encountered run time errors on 1,218 documents. I enclose a summary of these exceptions below also. There were two sources of error. First, the Enron corpus contains emails with lines longer than the default 10,000 characters used in the RFC822Parser parser. The other problem is that the Enron corpus contains malformed dates, which cause apache-mime4j to throw a MimeException.

        The first problem is easily fixed because RFC822Parser is configured from a MimeEntityConfig object, so passing in an object with a higher MaxLineLen - e.g. 60,000 - avoids these exceptions. I noticed that MimeEntityConfig also contains an option for "strict parsing". Currently MailContentHandler only performs strict parsing, i.e. if a MimeException is encountered when processing any fields in MailContentHandler.field then it is passed back up and processing the document fails. However, we may prefer not to have strict parsing i.e. continue even if processing one or more fields fails. This can be achieved by placing a try / catch block around the logic inside MailContentHandler.field(), and only rethrowing the error if strictParsing is enabled, otherwise we log the error.

        I then re-ran this on the entire corpus and it parsed successfully.

        Show
        Mark Butler added a comment - I took the Enron dataset and processed it using Tika and Behemoth. It contains 517,424 documents. Using Tika 0.9 I encountered runtime errors on 27,224 documents. Sorting the exceptions, there were four different stack traces. I enclose a summary of these exceptions below. However I did not see the problems with Tagsoup parsing that Benson reports? I then took the version of Tika in head. Here I encountered run time errors on 1,218 documents. I enclose a summary of these exceptions below also. There were two sources of error. First, the Enron corpus contains emails with lines longer than the default 10,000 characters used in the RFC822Parser parser. The other problem is that the Enron corpus contains malformed dates, which cause apache-mime4j to throw a MimeException. The first problem is easily fixed because RFC822Parser is configured from a MimeEntityConfig object, so passing in an object with a higher MaxLineLen - e.g. 60,000 - avoids these exceptions. I noticed that MimeEntityConfig also contains an option for "strict parsing". Currently MailContentHandler only performs strict parsing, i.e. if a MimeException is encountered when processing any fields in MailContentHandler.field then it is passed back up and processing the document fails. However, we may prefer not to have strict parsing i.e. continue even if processing one or more fields fails. This can be achieved by placing a try / catch block around the logic inside MailContentHandler.field(), and only rethrowing the error if strictParsing is enabled, otherwise we log the error. I then re-ran this on the entire corpus and it parsed successfully.
        Hide
        Julien Nioche added a comment -

        Good idea. We need more tutorials and example for Behemoth https://github.com/jnioche/behemoth and processing the Enron corpus with Tika would be an interesting one. We get the stacktraces in the Hadoop logs and could then look into the details of each problem

        Show
        Julien Nioche added a comment - Good idea. We need more tutorials and example for Behemoth https://github.com/jnioche/behemoth and processing the Enron corpus with Tika would be an interesting one. We get the stacktraces in the Hadoop logs and could then look into the details of each problem

          People

          • Assignee:
            Julien Nioche
            Reporter:
            Benson Margulies
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development