Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-418

Take another look at encoding detection

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3
    • 2.3
    • encoding
    • None

    Description

      In order to address various shortcomings of Tika encoding detection, I've had to modify the TikaEncodingDetector several times. Cf. ANY23-385 and ANY23-411. In the former, I placed a much greater weight on detected charsets declared in html meta elements & xml declarations. In the latter, I placed a much greater weight on charsets returned from HTTP Content-Type headers.

      However, after taking a look at TIKA-539, I'm thinking I should reduce this added weight (for at least html meta elements), and perhaps ignore it altogether (unless it happens to match UTF-8, since it seems that incorrect declarations usually declare something other than UTF-8, when the correct charset should be UTF-8).

      Something like > 90% of all webpages use UTF-8 encoding, and all of our encoding detection errors to date have revolved around something other than UTF-8 being detected when the correct encoding was actually UTF-8, not the other way around.

      Therefore, what I propose is the following:

      (1) In the absence of a Content-Type header, any declared hints that the charset is UTF-8 should add to the weight for UTF-8, while any declared hints that the charset is not UTF-8 should be ignored.

      (2) In the presence of a Content-Type header, any other declared hints should be ignored, unless they match UTF-8 and do not match the Content-Type header, in which case all hints, including the Content-Type header, should be ignored.

       EDIT: The above 2 points are a simplification of what I've actually implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See PR 131 for details.

      Attachments

        Issue Links

          Activity

            People

              hansbrende Hans Brende
              hansbrende Hans Brende
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: