[NUTCH-1871] Generic xsl parser plugin - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.9
Fix Version/s: 1.9
Component/s: indexer, parser
Labels:
None

Patch Info:

Patch Available

Description

The aim of this plugin is to use XSLT to extract metadata from HTML DOM structures.

Your Data

-->

Parse-html plugin or TIKA plugin

-->

DOM structure

-->

XSLT plugin

The main advantage is that:

You won't have to produce any java code, only XSLT and configuration
It can process DOM structure from DocumentFragment (@see NekoHtml and @see TagSoup)
It is HtmlParseFilter plugin compatible and can be plugged as any other plugin (parse-js, parse-swf, etc...)

This topic has been discussed on http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

xsl-parse-plugin.patch
05/Oct/14 20:01
273 kB
Albinscode

Issue Links

duplicates

NUTCH-1870 Generic xsl parser plugin

Open

Activity

People

Assignee:: Unassigned

Reporter:: Albinscode

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Oct/14 19:38

Updated:: 05/Oct/14 20:09

Resolved:: 05/Oct/14 20:09