Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
Description
This is apparently a boilerpipe issue , they fixed in the Web API edition .
$ curl --fail -L http://thisrecording.com/the-past | java -jar tika-app-0.9.jar -T % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 65688 0 65688 0 0 17650 0 --:--:-- 0:00:03 --:--:-- 18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to boilerpipe again 100 128k 0 128k 0 0 32019 0 --:--:-- 0:00:04 --:--:-- 33735 at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108) at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169) at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279) at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197) at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61) at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794) at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061) at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016) at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565) at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
Attachments
Issue Links
- is depended upon by
-
NUTCH-961 Expose Tika's boilerpipe support
- Closed