Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
0.9
-
None
-
None
Description
Attached HTML file causes IOexception from tagsoup.
(Changing CR to LF fixes problem.)
Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@22b6d6ab
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:203)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
Caused by: java.io.IOException: Pushback buffer overflow
at java.io.PushbackReader.unread(PushbackReader.java:138)
at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
... 5 more