[NUTCH-1870] Generic xsl parser plugin - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.9
Fix Version/s: None
Component/s: indexer, parser
Labels:
None

Patch Info:

Patch Available

Description

The aim of this plugin is to use XSLT to extract metadata from HTML DOM structures.

Your Data

-->

Parse-html plugin or TIKA plugin

-->

DOM structure

-->

XSLT plugin

The main advantage is that:

You won't have to produce any java code, only XSLT and configuration
It can process DOM structure from DocumentFragment (@see NekoHtml and @see TagSoup)
It is HtmlParseFilter plugin compatible and can be plugged as any other plugin (parse-js, parse-swf, etc...)

This topic has been discussed on http://www.mail-archive.com/dev%40nutch.apache.org/msg15257.html

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1870-trunk-v3.patch
10/Nov/14 21:40
50 kB
Sebastian Nagel
NUTCH-1870-trunk-v4.patch
25/Feb/15 21:55
55 kB
Sebastian Nagel
nutch-site.xml
04/Nov/14 19:55
1 kB
Albinscode
xsl-parse-plugin.patch
05/Oct/14 20:03
273 kB
Albinscode
xsl-parse-plugin2.patch
17/Oct/14 21:10
68 kB
Albinscode

Issue Links

is duplicated by

NUTCH-1871 Generic xsl parser plugin

Closed

is related to

NUTCH-1644 Should have a parser that uses xpath

Closed

links to

GitHub Pull Request #439

Activity

People

Assignee:: Unassigned

Reporter:: Albinscode

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/Oct/14 19:36

Updated:: 14/Jul/20 16:49