Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1193

Allow access to HtmlParser's HtmlSchema

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 1.5
    • parser
    • None

    Description

      TagSoup's HTMLSchema is not really well suited for HTML5 nor is it capable of correctly handling some very strange quirks, e.g. table inside anchors. By allowing access to the schema applications can modify the schema to suit their needs on the fly.

      This would also mean that we don't have to rely on TIKA-985 getting committed, we can change it from our own applications.

      Attachments

        1. TIKA-1193-trunk.patch
          0.9 kB
          Markus Jelsma
        2. TIKA-1193-trunk.patch
          3 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              jukkaz Jukka Zitting
              markus17 Markus Jelsma
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: