Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1599

Switch from TagSoup to JSoup

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.7, 1.8
    • 3.0.0-BETA
    • parser
    • None

    Description

      There are several Tika issues related to how TagSoup cleans up HTML (TIKA-381, TIKA-985, maybe TIKA-715), but TagSoup doesn't seem to be under active development.

      On the other hand I know of several projects that are now using JSoup, which is an active project (albeit only one main contributor) under the MIT license.

      I haven't looked into how hard it would be to switch this dependency.

      Attachments

        1. TIKA-1599-crazy-files.tar.gz
          115 kB
          Markus Jelsma
        2. tagsoup_vs_jsoup_reports.zip
          1.20 MB
          Tim Allison
        3. consumentenbond.html
          134 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              kkrugler Kenneth William Krugler
              kkrugler Kenneth William Krugler
              Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: