Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1596

HeadingsParseFilter not thread safe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7
    • 1.8
    • None
    • None

    Description

      The NodeWalker used by the HeadingsParseFilter sometimes reports a NullPointerException.

      2013-07-02 11:02:09,428 WARN  parse.ParseUtil - Error parsing .... with org.apache.nutch.parse.tika.TikaParser@2c8b586a
      java.util.concurrent.ExecutionException: java.lang.NullPointerException
              at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:262)
              at java.util.concurrent.FutureTask.get(FutureTask.java:119)
              at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:162)
              at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
              at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:963)
              at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:722)
      Caused by: java.lang.NullPointerException
              at org.apache.xerces.dom.ParentNode.nodeListItem(Unknown Source)
              at org.apache.xerces.dom.ParentNode.item(Unknown Source)
              at org.apache.nutch.util.NodeWalker.nextNode(NodeWalker.java:75)
              at org.apache.nutch.parse.headings.HeadingsParseFilter.getElement(HeadingsParseFilter.java:84)
              at org.apache.nutch.parse.headings.HeadingsParseFilter.filter(HeadingsParseFilter.java:47)
              at org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:98)
              at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:210)
              at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
              at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
              at java.util.concurrent.FutureTask.run(FutureTask.java:166)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:722)
      

      This is strange because it only rarely fails and the nextNode() method checks hasNext() and there is no concurrent access if i'm correct.

      Attachments

        1. NUTCH-1596-v1.patch
          1 kB
          Sebastian Nagel

        Activity

          People

            markus17 Markus Jelsma
            markus17 Markus Jelsma
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: