Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1253

Incompatible neko and xerces versions

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 2.3, 1.8
    • Component/s: None
    • Labels:
      None
    • Environment:

      Ubuntu 10.04

    • Patch Info:
      Patch Available

      Description

      The Nutch 1.4 distribution includes

      • nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
        nekohtml)
      • xercesImpl-2.9.1.jar (under .../runtime/local/lib)

      These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a
      catch(Throwable) clause in the getParse method to log the stacktrace.)

      I found that substituting a later, compatible version of nekohtml (1.9.11)
      fixes the problem.

      Curiously, and in support of the above, the nekohtml plugin.xml file in
      Nutch 1.4 contains the following:

      <plugin
      id="lib-nekohtml"
      name="CyberNeko HTML Parser"
      version="1.9.11"
      provider-name="org.cyberneko">

      <runtime>
      <library name="nekohtml-0.9.5.jar">
      <export name="*"/>
      </library>
      </runtime>
      </plugin>

      Note the conflicting version numbers (version tag is "1.9.11" but the
      specified library is "nekohtml-0.9.5.jar").

      Was the 0.9.5 version included by mistake? Was the intention rather to
      include 1.9.11?

        Attachments

        1. NUTCH-1253-2.x-eclipse.patch
          0.6 kB
          Talat UYARER
        2. NUTCH-1253-trunk.v2.patch
          5 kB
          Sebastian Nagel
        3. NUTCH-1253-trunk.patch
          48 kB
          Lewis John McGibbney
        4. nutch1253test.html
          0.4 kB
          Sebastian Nagel
        5. nutch1253parsed.html
          0.5 kB
          Sebastian Nagel
        6. TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt
          1 kB
          Lewis John McGibbney
        7. TEST-org.apache.nutch.parse.html.TestDOMContentUtils.txt
          1 kB
          Lewis John McGibbney
        8. NUTCH-1253-2.x-v2.patch
          3 kB
          Lewis John McGibbney
        9. NUTCH-1253-nutchgora.patch
          0.9 kB
          Lewis John McGibbney
        10. NUTCH-1253.patch
          0.9 kB
          Lewis John McGibbney

          Activity

            People

            • Assignee:
              lewismc Lewis John McGibbney
              Reporter:
              dspathis Dennis Spathis
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: