Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-333

Improve accuracy of charset detection for HTML pages

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 0.5
    • None
    • None
    • None

    Description

      Charset detection for HTML pages doesn't work all that well currently, due to the amount of text that's HTML markup at the beginning of the document.

      A simple solution would be to skip over the first 2K (assuming the document is long enough) before passing bytes to ICU4J.

      A more complex solution would be to scan for title and body tags, and pass bytes found in each.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kkrugler Kenneth William Krugler
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: