Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-952

HTML meta tags ignored for encoding detection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.1
    • None
    • general
    • None

    Description

      Hello,

      I have Czech HTML files that contain meta tag with correct encoding (windows-1250) but Tika ignores that and detects ISO-8859-2 and some times even ISO-8859-9 (turkish). Which causes wrong diacritics processing.

      Shouldn't it rather respect what HTML meta tag declares?

      HTML file header:
      <meta http-equiv=Content-Type content="text/html; charset=windows-1250">

      Tika detected metadata:
      Content-Encoding: ISO-8859-2
      Content-Type: text/html; charset=windows-1250

      I am not sure if I am reporting correctly. I will be happy to provide more infromation if necessary.

      Thanks,

      Tomas

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tssk Tomas Safarik
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: