Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-881

HtmlParser sometimes(!) throws IOException while determining Html-Encoding

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.0
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
    • Environment:

      Windows7, JDK1.5, JDK1.6

      Description

      Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately).

      java.io.IOException: Resetting to invalid mark
      at java.io.BufferedInputStream.reset(Unknown Source)
      at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
      at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
      at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)

      In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given.

      So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this:

      • ...
      • To enable the efficient conversion of bytes to characters, more bytes may
      • be read ahead from the underlying stream than are necessary to satisfy the
      • current read operation.
      • ...

      Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

        Attachments

        1. HtmlParser.java
          11 kB
          Klaus v. Einem
        2. BugfixHtmlParser.java
          10 kB
          Klaus v. Einem

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              v.einem Klaus v. Einem
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 0.5h
                0.5h
                Remaining:
                Remaining Estimate - 0.5h
                0.5h
                Logged:
                Time Spent - Not Specified
                Not Specified