Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2539

TagSoup HTML parser is project EOL

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.16, 1.17
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None
    • Environment:

      All

      Description

      The TagSoup HTML parser is project EOL, and the last update was to create the 1.2.1 version (that Tika references) back in Aug 2011.
      I cannot find any TagSoup forks that are still active but there are many alternative (and perhaps better if you believe the reviews and wikipedia comparisons) html parsers out there.
      Perhaps the most active is already pulled in by Tika as a transitive dependency of edu.ucar:grib, and that is jsoup with over 1,000 usages and updates as recent as a few months ago:
      https://mvnrepository.com/artifact/org.jsoup/jsoup
      https://jsoup.org/
      Requesting consideration of moving away from the long EOL'd TagSoup to an active and modern HTML parser like jsoup that is already a transitive Tika dependency.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Richard Jones Richard Jones
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: