I took the Enron dataset and processed it using Tika and Behemoth. It contains 517,424 documents.
Using Tika 0.9 I encountered runtime errors on 27,224 documents. Sorting the exceptions, there were four different stack traces. I enclose a summary of these exceptions below. However I did not see the problems with Tagsoup parsing that Benson reports?
I then took the version of Tika in head. Here I encountered run time errors on 1,218 documents. I enclose a summary of these exceptions below also. There were two sources of error. First, the Enron corpus contains emails with lines longer than the default 10,000 characters used in the RFC822Parser parser. The other problem is that the Enron corpus contains malformed dates, which cause apache-mime4j to throw a MimeException.
The first problem is easily fixed because RFC822Parser is configured from a MimeEntityConfig object, so passing in an object with a higher MaxLineLen - e.g. 60,000 - avoids these exceptions. I noticed that MimeEntityConfig also contains an option for "strict parsing". Currently MailContentHandler only performs strict parsing, i.e. if a MimeException is encountered when processing any fields in MailContentHandler.field then it is passed back up and processing the document fails. However, we may prefer not to have strict parsing i.e. continue even if processing one or more fields fails. This can be achieved by placing a try / catch block around the logic inside MailContentHandler.field(), and only rethrowing the error if strictParsing is enabled, otherwise we log the error.
I then re-ran this on the entire corpus and it parsed successfully.