Tika
  1. Tika
  2. TIKA-341

Use charset in CONTENT_TYPE metadata when detecting the character encoding

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: None
    • Labels:
      None

      Description

      If no content encoding is specified, and (for HTML pages) there's no explicit charset in the meta http-equiv tag, then the charset in the content-type metadata should be used as the "declared encoding" for the CharsetDetector.

      Related to this is that the CharsetDetector should have filtering turned on for HTML pages, so that tags get stripped out.

      1. TIKA-341.patch
        8 kB
        Ken Krugler

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Ken Krugler
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development