Details
Description
The Nutch 1.4 distribution includes
- nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, rebuild the HtmlParser plugin and add a
catch(Throwable) clause in the getParse method to log the stacktrace.)
I found that substituting a later, compatible version of nekohtml (1.9.11)
fixes the problem.
Curiously, and in support of the above, the nekohtml plugin.xml file in
Nutch 1.4 contains the following:
<plugin
id="lib-nekohtml"
name="CyberNeko HTML Parser"
version="1.9.11"
provider-name="org.cyberneko">
<runtime>
<library name="nekohtml-0.9.5.jar">
<export name="*"/>
</library>
</runtime>
</plugin>
Note the conflicting version numbers (version tag is "1.9.11" but the
specified library is "nekohtml-0.9.5.jar").
Was the 0.9.5 version included by mistake? Was the intention rather to
include 1.9.11?