Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2485

EncodingDetectors markLimits to be configurable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.16
    • Fix Version/s: 1.17
    • Component/s: detector
    • Labels:
      None

      Description

      Tim's response to my question:

      ----Original message----
      > From:Allison, Timothy B. <tallison@mitre.org>
      > Sent: Friday 27th October 2017 14:53
      > To: user@tika.apache.org
      > Subject: RE: Incorrect encoding detected
      >
      > Hi Markus,
      >
      > My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection. The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
      >
      > At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
      >
      > Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
      >
      > Cheers,
      >
      > Tim
      >
      > [0] https://issues.apache.org/jira/browse/TIKA-2038
      >
      >
      > ----Original Message----
      > From: Markus Jelsma markus.jelsma@openindex.io
      > Sent: Friday, October 27, 2017 8:39 AM
      > To: user@tika.apache.org
      > Subject: Incorrect encoding detected
      >
      > Hello,
      >
      > We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
      >
      > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
      >
      > Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.

        Attachments

          Activity

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              markus17 Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: