Details
Description
The aim of this plugin is to use XSLT to extract metadata from HTML DOM structures.
Your Data | --> | Parse-html plugin or TIKA plugin | --> | DOM structure | --> | XSLT plugin |
The main advantage is that:
- You won't have to produce any java code, only XSLT and configuration
- It can process DOM structure from DocumentFragment (@see NekoHtml and @see TagSoup)
- It is HtmlParseFilter plugin compatible and can be plugged as any other plugin (parse-js, parse-swf, etc...)
This topic has been discussed on http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html
Attachments
Attachments
Issue Links
- is duplicated by
-
NUTCH-1871 Generic xsl parser plugin
- Closed
- is related to
-
NUTCH-1644 Should have a parser that uses xpath
- Closed
- links to