Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-676

Boilerpipe fails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • None
    • parser
    • None

    Description

      This is apparently a boilerpipe issue , they fixed in the Web API edition .

      $ curl --fail -L http://thisrecording.com/the-past | java -jar tika-app-0.9.jar -T
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100 65688    0 65688    0     0  17650      0 --:--:--  0:00:03 --:--:-- 18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to boilerpipe again
      100  128k    0  128k    0     0  32019      0 --:--:--  0:00:04 --:--:-- 33735
      	at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
      	at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
      	at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
      	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
      	at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
      	at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
      	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
      	at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
      	at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
      	at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
      	at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
      	at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
      	at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
      	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
      	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
      	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
      	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              simpatico Gabriele Kahlout
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: