Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1102

Can we add <div> to the list of heuristics for bad html fragments?

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.2, 1.3
    • 1.4
    • parser
    • None
    • I'm using Solr 4.0 final with tika v1.2 and ManifoldCF v1.2dev all on tomcat 7.0.37

    Description

      Good morning,
      Crawling legacy sites with poorly written html fragments causes severe Solr Xml parse errors and in turn causes ManifoldCF to abort.
      Can we add <div> to the list of heuristics so the html parser is used instead of the xml parser?
      see this ticket for further information: TIKA-1101

      Thank you,

      Attachments

        Issue Links

          Activity

            People

              kkrugler Kenneth William Krugler
              dmorana David Morana
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: