Tika
  1. Tika
  2. TIKA-657

Email parser gets into trouble on malformed html in enron corpus

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      There is a very large corpus of email addresses available: http://www.cs.cmu.edu/~enron/.

      In processing even a subset of this corpus, I see numerous 'unexpected RuntimeException' errors resulting from tagsoup throwing on truly awful html. It seems to me that being able to do something with this entire stack would make a good '1.0' criteria for tika's email parser.

        Activity

        Benson Margulies created issue -
        Julien Nioche made changes -
        Field Original Value New Value
        Assignee Julien Nioche [ jnioche ]
        Mark Butler made changes -
        Attachment tika0.9-enron-errors-summary.txt [ 12480638 ]
        Mark Butler made changes -
        Jukka Zitting made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.0 [ 12317967 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Julien Nioche
            Reporter:
            Benson Margulies
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development