Nutch
  1. Nutch
  2. NUTCH-887

Delegate parsing of feeds to Tika

    Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: nutchgora
    • Fix Version/s: 2.4
    • Component/s: parser
    • Labels:
      None

      Description

      [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]

      One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.

      There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it?

      Any thoughts on this?

        Activity

          People

          • Assignee:
            Unassigned
            Reporter:
            Julien Nioche
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development