Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-334

HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.5
    • 0.6
    • None
    • None

    Description

      Currently the HtmlParser will just call TagSoup to parse, without specifying a charset, if no charset is passed in via metadata.

      TagSoup uses the platform encoding in this case, which is often going to be wrong.

      The right thing to do is to first check for a charset specified by a meta tag. If that doesn't exist, then create a CharsetDetector. If there's a charset in the incoming meta-data, use that to call setDeclaredEncoding().

      Attachments

        1. TIKA-334.patch
          5 kB
          Kenneth William Krugler

        Activity

          People

            jukkaz Jukka Zitting
            kkrugler Kenneth William Krugler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: