Description
Using `TikaEncodingDetector.guessEncoding` may result in an `ArrayIndexOutOfBoundsException`.
The following snippet:
String encoding = new TikaEncodingDetector().guessEncoding(new URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream()); System.out.println(encoding);
Will result in the following exception:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at org.jsoup.parser.Parser.parseFragment(Parser.java:140) at org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58)
Whereas the expected result is `ISO-8859-15`
Note the bunch of HTML at the bottom of the page after the `</html>` tag.
Replacing:
ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);
By:
ParseErrorList htmlErrors = ParseErrorList.tracking(100);
Will fix the issue. Not quite sure why, maybe at one point the errors are too far and the reader cannot reset far enough...
Attachments
Issue Links
- links to