[TIKA-2485] EncodingDetectors markLimits to be configurable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.16
Fix Version/s: 1.17
Component/s: detector
Labels:
None

Description

Tim's response to my question:

----~~Original message~~----
> From:Allison, Timothy B. <tallison@mitre.org>
> Sent: Friday 27th October 2017 14:53
> To: user@tika.apache.org
> Subject: RE: Incorrect encoding detected
>
> Hi Markus,
>
> My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection. The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
>
> At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
>
> Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
>
> Cheers,
>
> Tim
>
> [0] https://issues.apache.org/jira/browse/TIKA-2038
>
>
> ----~~Original Message~~----
> From: Markus Jelsma markus.jelsma@openindex.io
> Sent: Friday, October 27, 2017 8:39 AM
> To: user@tika.apache.org
> Subject: Incorrect encoding detected
>
> Hello,
>
> We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
>
> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
>
> Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.

Attachments

Activity

People

Assignee:: Tim Allison

Reporter:: Markus Jelsma

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Oct/17 13:36

Updated:: 02/Nov/17 16:21

Resolved:: 02/Nov/17 13:43