Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2050

HTMLEncodingDetector class fails on some HTML documents

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • parser
    • None

    Description

      When tallison@mitre.org and I were working on TIKA-2038 I found out that HTMLEncodingDetector class cannot extract charsets from some HTML documents. I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It seems that its regex should be corrected to cover these cases.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              faghani Shabanali Faghani
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: