Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-341

Use charset in CONTENT_TYPE metadata when detecting the character encoding

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.6
    • 0.6
    • None
    • None

    Description

      If no content encoding is specified, and (for HTML pages) there's no explicit charset in the meta http-equiv tag, then the charset in the content-type metadata should be used as the "declared encoding" for the CharsetDetector.

      Related to this is that the CharsetDetector should have filtering turned on for HTML pages, so that tags get stripped out.

      Attachments

        1. TIKA-341.patch
          8 kB
          Kenneth William Krugler

        Issue Links

          Activity

            People

              jukkaz Jukka Zitting
              kkrugler Kenneth William Krugler
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: