Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1193

Allow access to HtmlParser's HtmlSchema

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.5
    • Component/s: parser
    • Labels:
      None

      Description

      TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of correctly handling some very strange quirks, e.g. table inside anchors. By allowing access to the schema applications can modify the schema to suit their needs on the fly.

      This would also mean that we don't have to rely on TIKA-985 getting committed, we can change it from our own applications.

        Attachments

        1. TIKA-1193-trunk.patch
          3 kB
          Markus Jelsma
        2. TIKA-1193-trunk.patch
          0.9 kB
          Markus Jelsma

          Issue Links

            Activity

              People

              • Assignee:
                jukkaz Jukka Zitting
                Reporter:
                markus17 Markus Jelsma
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: