Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1253

Incompatible neko and xerces versions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 2.3, 1.8
    • None
    • None
    • Ubuntu 10.04

    • Patch Available

    Description

      The Nutch 1.4 distribution includes

      • nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
        nekohtml)
      • xercesImpl-2.9.1.jar (under .../runtime/local/lib)

      These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a
      catch(Throwable) clause in the getParse method to log the stacktrace.)

      I found that substituting a later, compatible version of nekohtml (1.9.11)
      fixes the problem.

      Curiously, and in support of the above, the nekohtml plugin.xml file in
      Nutch 1.4 contains the following:

      <plugin
      id="lib-nekohtml"
      name="CyberNeko HTML Parser"
      version="1.9.11"
      provider-name="org.cyberneko">

      <runtime>
      <library name="nekohtml-0.9.5.jar">
      <export name="*"/>
      </library>
      </runtime>
      </plugin>

      Note the conflicting version numbers (version tag is "1.9.11" but the
      specified library is "nekohtml-0.9.5.jar").

      Was the 0.9.5 version included by mistake? Was the intention rather to
      include 1.9.11?

      Attachments

        1. NUTCH-1253.patch
          0.9 kB
          Lewis John McGibbney
        2. NUTCH-1253-nutchgora.patch
          0.9 kB
          Lewis John McGibbney
        3. NUTCH-1253-2.x-v2.patch
          3 kB
          Lewis John McGibbney
        4. TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt
          1 kB
          Lewis John McGibbney
        5. TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt
          1 kB
          Lewis John McGibbney
        6. nutch1253parsed.html
          0.5 kB
          Sebastian Nagel
        7. nutch1253test.html
          0.4 kB
          Sebastian Nagel
        8. NUTCH-1253-trunk.patch
          48 kB
          Lewis John McGibbney
        9. NUTCH-1253-trunk.v2.patch
          5 kB
          Sebastian Nagel
        10. NUTCH-1253-2.x-eclipse.patch
          0.6 kB
          Talat Uyarer

        Activity

          People

            lewismc Lewis John McGibbney
            dspathis Dennis Spathis
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: