Tika
  1. Tika
  2. TIKA-881

HtmlParser sometimes(!) throws IOException while determining Html-Encoding

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 1.0
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
    • Environment:

      Windows7, JDK1.5, JDK1.6

      Description

      Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately).

      java.io.IOException: Resetting to invalid mark
      at java.io.BufferedInputStream.reset(Unknown Source)
      at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
      at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
      at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)

      In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given.

      So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this:

      • ...
      • To enable the efficient conversion of bytes to characters, more bytes may
      • be read ahead from the underlying stream than are necessary to satisfy the
      • current read operation.
      • ...

      Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

      1. BugfixHtmlParser.java
        10 kB
        Klaus v. Einem
      2. HtmlParser.java
        11 kB
        Klaus v. Einem

        Activity

        Jukka Zitting made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        Hide
        Jukka Zitting added a comment -

        This has been fixed meanwhile with the AutoDetectReader class that the HtmlParser is now using to detect the content encoding.

        Show
        Jukka Zitting added a comment - This has been fixed meanwhile with the AutoDetectReader class that the HtmlParser is now using to detect the content encoding.
        Ken Krugler made changes -
        Assignee Ken Krugler [ kkrugler ]
        Hide
        Ken Krugler added a comment -

        Looks like an InputStream issue, not something with HtmlParser. Inputstreams should get "wrapped" by Tika such that a reset() will always work.

        Show
        Ken Krugler added a comment - Looks like an InputStream issue, not something with HtmlParser. Inputstreams should get "wrapped" by Tika such that a reset() will always work.
        Hide
        Ken Krugler added a comment -

        I've asked Jukka to look into this. From my email to tika-dev:

        The fix that Klaus provided avoids using reset() on the input stream.

        But I thought that Tika tries to wrap streams such that a reset() will work properly, as otherwise auto detection of content can fail.

        I haven't had to dig into all of the tricky issues around stream management, so I'm hoping you can take a look at Klaus's report and provide commentary.

        Show
        Ken Krugler added a comment - I've asked Jukka to look into this. From my email to tika-dev: The fix that Klaus provided avoids using reset() on the input stream. But I thought that Tika tries to wrap streams such that a reset() will work properly, as otherwise auto detection of content can fail. I haven't had to dig into all of the tricky issues around stream management, so I'm hoping you can take a look at Klaus's report and provide commentary.
        Hide
        Ken Krugler added a comment -

        Hi Klaus - thanks for debugging this. I'll take a look at your patch over the next few days.

        Show
        Ken Krugler added a comment - Hi Klaus - thanks for debugging this. I'll take a look at your patch over the next few days.
        Ken Krugler made changes -
        Assignee Ken Krugler [ kkrugler ]
        Klaus v. Einem made changes -
        Attachment HtmlParser.java [ 12519436 ]
        Hide
        Klaus v. Einem added a comment - - edited

        HtmlParser.java: This is 100% original sourcecode with Bugfix included.

        Show
        Klaus v. Einem added a comment - - edited HtmlParser.java: This is 100% original sourcecode with Bugfix included.
        Klaus v. Einem made changes -
        Field Original Value New Value
        Attachment BugfixHtmlParser.java [ 12519430 ]
        Hide
        Klaus v. Einem added a comment - - edited

        BugfixHtmlParser.java: This is my Workaround... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a bytes array and decoding (afterwards) with the String constructor.

        To get this up an running you have to copy the 2 sourcfiles HtmlHandler.java and XHTMLDowngradeHandler from the tika-sources (package: org.apache.tika.parser.html) to the package, where BugfixHtmlParser.java lives. Why? Because of their package private nature.

        Show
        Klaus v. Einem added a comment - - edited BugfixHtmlParser.java: This is my Workaround... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a bytes array and decoding (afterwards) with the String constructor. To get this up an running you have to copy the 2 sourcfiles HtmlHandler.java and XHTMLDowngradeHandler from the tika-sources (package: org.apache.tika.parser.html) to the package, where BugfixHtmlParser.java lives. Why? Because of their package private nature.
        Klaus v. Einem created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Klaus v. Einem
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 0.5h
              0.5h
              Remaining:
              Remaining Estimate - 0.5h
              0.5h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development