Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2933

Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Revisit "replacement" encoding mappings in StandardHtmlEncodingDetector.

      I'm finally getting around to running the comparisons between our legacy HTMLEncodingDetector and the newer StandardHTMLEncodingDetector. More analysis is required, but the newer one is, generally better*. One area for improvement/explanation, though is in the "replacement" encoding.

      • There are 1 million more "common words" in text extracted from files with the StandardHtmlEncodingDetector than with only our legacy. There are 133M common words in our legacy extracts so that's less than 1% improvement.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tallison Tim Allison
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: