Nutch
  1. Nutch
  2. NUTCH-887

Delegate parsing of feeds to Tika

    Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: nutchgora
    • Fix Version/s: 2.4
    • Component/s: parser
    • Labels:
      None

      Description

      [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]

      One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.

      There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it?

      Any thoughts on this?

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            Julien Nioche
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development