[NUTCH-887] Delegate parsing of feeds to Tika - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Closed
Priority: Major
Resolution: Auto Closed
Affects Version/s: nutchgora
Fix Version/s: 2.5
Component/s: parser
Labels:
None

Description

[Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]

One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.

There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it?

Any thoughts on this?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Julien Nioche

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 14/Aug/10 08:30

Updated:: 13/Oct/19 22:36

Resolved:: 13/Oct/19 22:36