Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2071

A parser failure on a single document may fail crawling job if parser.timeout=-1

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.14, 1.15
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Flags:
      Patch

      Description

      java.io.IOException: Job failed!
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
      at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
      <...>
      Caused by: java.lang.IncompatibleClassChangeError: class org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as super class
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
      at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
      at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
      at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
      at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)

      Suggested fix in ParseUtil:

      Replace

      if (maxParseTime!=-1)
      parseResult = runParser(parsers[i], content);
      else
      parseResult = parsers[i].getParse(content);

      with

      try

      { if (maxParseTime!=-1) parseResult = runParser(parsers[i], content); else parseResult = parsers[i].getParse(content); }

      catch( Throwable e )

      { LOG.warn( "Parsing " + content.getUrl() + " with " + parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ; parseResult = null ; }

        Attachments

        1. NUTCH-2071.diff
          2 kB
          Arkadi Kosmynin

          Issue Links

            Activity

              People

              • Assignee:
                wastl-nagel Sebastian Nagel
                Reporter:
                ArkadiKosmynin Arkadi Kosmynin
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: