[NUTCH-1918] TikaParser specifies a default namespace when generating DOM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.10
Component/s: parser
Labels:
None

Patch Info:

Patch Available

Description

The DOM generated by parse-tika differs from the one done by parse-html. Ideally we should be able to use either parsers with the same XPath expressions.
This is related to ~~NUTCH-1592~~, but this time instead of being a matter of uppercases, the problem comes from the namespace used.
This issue has been investigated and fixed in storm-crawler https://github.com/DigitalPebble/storm-crawler/pull/58.

Here is what Guillaume explained there :

When parsing the content, Tika creates a properly formatted XHTML document: all elements are created within the namespace XHTML.

However in XPath 1.0, there's no concept of default namespace so XPath expressions such as //BODY doesn't match anything. To make this work we should use //ns1:BODY and define a NamespaceContext which associates ns1 with "http://www.w3.org/1999/xhtml"

To keep the XPathExpressions simpler, I modified the DOMBuilder which is our SaxHandler used to convert the SAX Events into a DOM tree to ignore a "default name space" and the ParserBolt initializes it with the XHTML namespace. This way //BODY matches.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1918.patch
15/Jan/15 11:04
2 kB
Julien Nioche

Activity

People

Assignee:: Unassigned

Reporter:: Julien Nioche

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Jan/15 11:01

Updated:: 13/Mar/24 14:51

Resolved:: 30/Jan/15 09:06