Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2071

A parser failure on a single document may fail crawling job if parser.timeout=-1

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.11
    • 1.14, 1.15
    • parser
    • None
    • Patch Available
    • Patch

    Description

      java.io.IOException: Job failed!
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
      at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
      <...>
      Caused by: java.lang.IncompatibleClassChangeError: class org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as super class
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
      at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
      at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
      at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
      at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)

      Suggested fix in ParseUtil:

      Replace

      if (maxParseTime!=-1)
      parseResult = runParser(parsers[i], content);
      else
      parseResult = parsers[i].getParse(content);

      with

      try

      { if (maxParseTime!=-1) parseResult = runParser(parsers[i], content); else parseResult = parsers[i].getParse(content); }

      catch( Throwable e )

      { LOG.warn( "Parsing " + content.getUrl() + " with " + parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ; parseResult = null ; }

      Attachments

        1. NUTCH-2071.diff
          2 kB
          Arkadi Kosmynin

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              ArkadiKosmynin Arkadi Kosmynin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: