Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1519

Don't allow whatever is in http-equiv Content-Type to overwrite actual Content-Type in HtmlParser

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 1.6
    • 1.8
    • None
    • None

    Description

      The HtmlParser will overwrite the value of Content-Type in Metadata with any value of content in an http-equiv=Content-Type header, e.g.

      <meta http-equiv=Content-Type content="blah de blah blah">

      .

      or even worse, perhaps:
      <meta http-equiv=Content-Type content="application/pdf">

      Let's capture the content type alleged by the html file in a different key from Content-Type; I'd prefer to reserve Content-Type for "text/html; charset=X".

      Candidate key/Property: Content-Type-Meta-HTTP-Equiv?

      See TIKA-1514 for example output.

      Attachments

        1. TIKA-1519.patch
          7 kB
          Tim Allison

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: