Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-441

TikaEncodingDetector: guessEncoding may throws an ArrayIndexOutOfBoundsException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.3
    • 2.4
    • encoding
    • None

    Description

      Using `TikaEncodingDetector.guessEncoding` may result in an `ArrayIndexOutOfBoundsException`.

       

      The following snippet:

      String encoding = new TikaEncodingDetector().guessEncoding(new URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream());
      
      System.out.println(encoding);

      Will result in the following exception:

      Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at org.jsoup.parser.Parser.parseFragment(Parser.java:140) at org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159) at org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58)

      Whereas the expected result is `ISO-8859-15`

      Note the bunch of HTML at the bottom of the page after the `</html>` tag.

       

      Replacing:

      ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);
      

      By:

      ParseErrorList htmlErrors = ParseErrorList.tracking(100);
      

       

      Will fix the issue. Not quite sure why, maybe at one point the errors are too far and the reader cannot reset far enough...

       

       

      Attachments

        Issue Links

          Activity

            People

              hansbrende Hans Brende
              panthony Anthony Pessy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h