[ANY23-441] TikaEncodingDetector: guessEncoding may throws an ArrayIndexOutOfBoundsException - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3
Fix Version/s: 2.4
Component/s: encoding
Labels:
None

Description

Using `TikaEncodingDetector.guessEncoding` may result in an `ArrayIndexOutOfBoundsException`.

The following snippet:

String encoding = new TikaEncodingDetector().guessEncoding(new URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream());

System.out.println(encoding);

Will result in the following exception:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at org.jsoup.parser.Parser.parseFragment(Parser.java:140) at org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58)

Whereas the expected result is `ISO-8859-15`

Note the bunch of HTML at the bottom of the page after the `</html>` tag.

Replacing:

ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);

By:

ParseErrorList htmlErrors = ParseErrorList.tracking(100);

Will fix the issue. Not quite sure why, maybe at one point the errors are too far and the reader cannot reset far enough...

Attachments

Issue Links

links to

GitHub Pull Request #139

Activity

People

Assignee:: Hans Brende

Reporter:: Anthony Pessy

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 27/Aug/19 07:48

Updated:: 21/Sep/20 02:01

Resolved:: 29/Mar/20 23:15

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: