Description
The DOM generated by parse-tika differs from the one done by parse-html. Ideally we should be able to use either parsers with the same XPath expressions.
This is related to NUTCH-1592, but this time instead of being a matter of uppercases, the problem comes from the namespace used.
This issue has been investigated and fixed in storm-crawler https://github.com/DigitalPebble/storm-crawler/pull/58.
Here is what Guillaume explained there :
When parsing the content, Tika creates a properly formatted XHTML document: all elements are created within the namespace XHTML.
However in XPath 1.0, there's no concept of default namespace so XPath expressions such as //BODY doesn't match anything. To make this work we should use //ns1:BODY and define a NamespaceContext which associates ns1 with "http://www.w3.org/1999/xhtml"
To keep the XPathExpressions simpler, I modified the DOMBuilder which is our SaxHandler used to convert the SAX Events into a DOM tree to ignore a "default name space" and the ParserBolt initializes it with the XHTML namespace. This way //BODY matches.