Details
Description
Good morning,
Crawling legacy sites with poorly written html fragments causes severe Solr Xml parse errors and in turn causes ManifoldCF to abort.
Can we add <div> to the list of heuristics so the html parser is used instead of the xml parser?
see this ticket for further information: TIKA-1101
Thank you,
Attachments
Issue Links
- relates to
-
TIKA-1101 XML parse error caused by org.xml.sax.SAXParseException;The entity "nbsp" was referenced, but not declared
- Closed