Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-952

HTML meta tags ignored for encoding detection

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.1
    • Fix Version/s: None
    • Component/s: general
    • Labels:
      None

      Description

      Hello,

      I have Czech HTML files that contain meta tag with correct encoding (windows-1250) but Tika ignores that and detects ISO-8859-2 and some times even ISO-8859-9 (turkish). Which causes wrong diacritics processing.

      Shouldn't it rather respect what HTML meta tag declares?

      HTML file header:
      <meta http-equiv=Content-Type content="text/html; charset=windows-1250">

      Tika detected metadata:
      Content-Encoding: ISO-8859-2
      Content-Type: text/html; charset=windows-1250

      I am not sure if I am reporting correctly. I will be happy to provide more infromation if necessary.

      Thanks,

      Tomas

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tssk Tomas Safarik
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: