Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-657

Email parser gets into trouble on malformed html in enron corpus

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      There is a very large corpus of email addresses available: http://www.cs.cmu.edu/~enron/.

      In processing even a subset of this corpus, I see numerous 'unexpected RuntimeException' errors resulting from tagsoup throwing on truly awful html. It seems to me that being able to do something with this entire stack would make a good '1.0' criteria for tika's email parser.

        Attachments

          Activity

            People

            • Assignee:
              jnioche Julien Nioche
              Reporter:
              bmargulies Benson Margulies
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: