Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.7, 1.8
-
None
Description
There are several Tika issues related to how TagSoup cleans up HTML (TIKA-381, TIKA-985, maybe TIKA-715), but TagSoup doesn't seem to be under active development.
On the other hand I know of several projects that are now using JSoup, which is an active project (albeit only one main contributor) under the MIT license.
I haven't looked into how hard it would be to switch this dependency.
Attachments
Attachments
Issue Links
- is duplicated by
-
TIKA-2539 TagSoup HTML parser is project EOL
- Resolved
- is related to
-
TIKA-2010 Unable to get <title> value when header is incorrect
- Resolved
-
TIKA-2928 Less than sign within tag boundaries considered as start of a new tag.
- Resolved
- relates to
-
TIKA-2562 tika server parse HTML removes DIVs around hyperlink & adds shape
- Resolved
-
TIKA-4109 Remove use of EOL component TagSoup 1.2.1 from tika-parsers-standard-package
- Resolved
- supercedes
-
TIKA-1808 Head section closed too eager
- Resolved
- links to