Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1514

http-equiv content-type extraction should pick first parseable content value

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Trivial
    • Resolution: Won't Fix
    • 1.6
    • 1.8
    • None
    • None

    Description

      In a handful of files from govdocs1, there are some creative http-equiv content-type headers, including:

      <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" name="keywords" content="DNRC, division of nutrition">
      

      The content type that is going into the metadata for this file is "DNRC, division of nutrition".

      Let's modify our html metaheader charset detector to pick the first parseable charset value.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: