Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-608

IOException from tagsoup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 0.9
    • None
    • parser
    • None

    Description

      Attached HTML file causes IOexception from tagsoup.
      (Changing CR to LF fixes problem.)
      Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@22b6d6ab
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
      at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
      at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
      Caused by: java.io.IOException: Pushback buffer overflow
      at java.io.PushbackReader.unread(PushbackReader.java:138)
      at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
      at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
      at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
      at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      ... 5 more

      Attachments

        1. test.html
          4 kB
          Erik Hetzner

        Activity

          People

            kkrugler Kenneth William Krugler
            egh Erik Hetzner
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: