Nutch
  1. Nutch
  2. NUTCH-887

Delegate parsing of feeds to Tika

    Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: nutchgora
    • Fix Version/s: 2.4
    • Component/s: parser
    • Labels:
      None

      Description

      [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]

      One of the plugins which hasn't been ported yet is the feed parser. We could rely on the one we recently added to Tika, knowing that there is a substantial difference in the sense that the Tika feed parser generates a simple XHTML representation of the document where the feeds are simply represented as anchors whereas the Nutch version created new documents for each feed.

      There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it?

      Any thoughts on this?

        Activity

        Lewis John McGibbney made changes -
        Fix Version/s 2.4 [ 12324540 ]
        Fix Version/s 2.3 [ 12324325 ]
        Lewis John McGibbney made changes -
        Fix Version/s 2.3 [ 12324325 ]
        Fix Version/s 2.2 [ 12323285 ]
        Lewis John McGibbney made changes -
        Fix Version/s 2.2 [ 12323285 ]
        Fix Version/s 2.1 [ 12321040 ]
        Lewis John McGibbney made changes -
        Field Original Value New Value
        Fix Version/s 2.1 [ 12321040 ]
        Fix Version/s nutchgora [ 12314893 ]
        Hide
        Lewis John McGibbney added a comment -

        Set and Classify

        Show
        Lewis John McGibbney added a comment - Set and Classify
        Hide
        Julien Nioche added a comment -

        This issue is about parse-feeds and it requires some changes to the way Nutch 2.0 works (compound docs - see comments above). Let's leave it open for now

        Show
        Julien Nioche added a comment - This issue is about parse-feeds and it requires some changes to the way Nutch 2.0 works (compound docs - see comments above). Let's leave it open for now
        Hide
        Markus Jelsma added a comment -

        Julien committed NUTCH-888 for 1.3 and trunk. I guess this issue can be closed?

        Show
        Markus Jelsma added a comment - Julien committed NUTCH-888 for 1.3 and trunk. I guess this issue can be closed?
        Hide
        Julien Nioche added a comment -

        Have created https://issues.apache.org/jira/browse/NUTCH-888 and will remove parse-rss tomorrow.

        Show
        Julien Nioche added a comment - Have created https://issues.apache.org/jira/browse/NUTCH-888 and will remove parse-rss tomorrow.
        Hide
        Chris A. Mattmann added a comment -

        Ah, good - I missed that, I need to take a closer look at this...

        Np, let me know what you think. If it needs improvement, I'll be happy to pick up a shovel, and help out.

        The "creep" so far is just parse-html, which we were forced to add back because Tika HTML parsing was totally inadequate to our needs. I know there have been some progress on this front, but I suspect it's still not sufficient. The ultimate goal is still to use Tika for all formats that it can handle, preferrably "all formats" without further qualifiers

        Coo coo, thanks Andrzej!

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - Ah, good - I missed that, I need to take a closer look at this... Np, let me know what you think. If it needs improvement, I'll be happy to pick up a shovel, and help out. The "creep" so far is just parse-html, which we were forced to add back because Tika HTML parsing was totally inadequate to our needs. I know there have been some progress on this front, but I suspect it's still not sufficient. The ultimate goal is still to use Tika for all formats that it can handle, preferrably "all formats" without further qualifiers Coo coo, thanks Andrzej! Cheers, Chris
        Hide
        Andrzej Bialecki added a comment -

        Huh, what do you mean? Nick just added a bunch of code to handle Compound document detection, and parsing

        Ah, good - I missed that, I need to take a closer look at this...

        I'm starting to feel the creep of parsing plugins make their way back into Nutch instead of just jumping over into Tika

        The "creep" so far is just parse-html, which we were forced to add back because Tika HTML parsing was totally inadequate to our needs. I know there have been some progress on this front, but I suspect it's still not sufficient. The ultimate goal is still to use Tika for all formats that it can handle, preferrably "all formats" without further qualifiers

        Show
        Andrzej Bialecki added a comment - Huh, what do you mean? Nick just added a bunch of code to handle Compound document detection, and parsing Ah, good - I missed that, I need to take a closer look at this... I'm starting to feel the creep of parsing plugins make their way back into Nutch instead of just jumping over into Tika The "creep" so far is just parse-html, which we were forced to add back because Tika HTML parsing was totally inadequate to our needs. I know there have been some progress on this front, but I suspect it's still not sufficient. The ultimate goal is still to use Tika for all formats that it can handle, preferrably "all formats" without further qualifiers
        Hide
        Chris A. Mattmann added a comment -

        There is something missing in Tika, and it's the support for compound documents, but it's not likely to be added in 0.8

        Huh, what do you mean? Nick just added a bunch of code to handle Compound document detection, and parsing, see TIKA-447 and the discussions on the wiki here: http://wiki.apache.org/tika/MetadataDiscussion. It may not be complete yet, but neither is 0.8.

        I'd keep the "feed" plugin around for a while still, as an interim solution until Tika supports compound documents. +1 to getting rid of parse-rss.

        +1, I agree, but I still believe our goal should be to delegate this to Tika. I'm starting to feel the creep of parsing plugins make their way back into Nutch instead of just jumping over into Tika and working the process over there. In the end, if we start to add back all the parsing plugins, I'm not sure we've accomplished our goal...

        Show
        Chris A. Mattmann added a comment - There is something missing in Tika, and it's the support for compound documents, but it's not likely to be added in 0.8 Huh, what do you mean? Nick just added a bunch of code to handle Compound document detection, and parsing, see TIKA-447 and the discussions on the wiki here: http://wiki.apache.org/tika/MetadataDiscussion . It may not be complete yet, but neither is 0.8. I'd keep the "feed" plugin around for a while still, as an interim solution until Tika supports compound documents. +1 to getting rid of parse-rss. +1, I agree, but I still believe our goal should be to delegate this to Tika. I'm starting to feel the creep of parsing plugins make their way back into Nutch instead of just jumping over into Tika and working the process over there. In the end, if we start to add back all the parsing plugins, I'm not sure we've accomplished our goal...
        Hide
        Andrzej Bialecki added a comment -

        If there's something missing that Nutch needs, we'll add it to Tika and roll it into 0.8.

        There is something missing in Tika, and it's the support for compound documents, but it's not likely to be added in 0.8... not that we have such support in Nutch at the moment - it fell victim to the trunk/nutchbase switch, but it should be added back soon. I'd keep the "feed" plugin around for a while still, as an interim solution until Tika supports compound documents. +1 to getting rid of parse-rss.

        Show
        Andrzej Bialecki added a comment - If there's something missing that Nutch needs, we'll add it to Tika and roll it into 0.8. There is something missing in Tika, and it's the support for compound documents, but it's not likely to be added in 0.8... not that we have such support in Nutch at the moment - it fell victim to the trunk/nutchbase switch, but it should be added back soon. I'd keep the "feed" plugin around for a while still, as an interim solution until Tika supports compound documents. +1 to getting rid of parse-rss.
        Hide
        Chris A. Mattmann added a comment -

        Hey Julien:

        +1 to relying on Tika for RSS parsing. If there's something missing that Nutch needs, we'll add it to Tika and roll it into 0.8.

        There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it?

        I wrote parse-rss back in 2005, and used commons-feedparser from Kevin Burton and his crew. At the time it was well developed, and a little more flexible and easier for me to pick up than Rome. Since then however, its development has really become stagnant and it is no longer maintained.

        In terms of real differences in terms of functionality, they are roughly equivalent so there isn't much difference. I would suggest we move forward with the feed plugin in Tika and roll it back in through Nutch.

        Show
        Chris A. Mattmann added a comment - Hey Julien: +1 to relying on Tika for RSS parsing. If there's something missing that Nutch needs, we'll add it to Tika and roll it into 0.8. There is also the parse-rss plugin in Nutch which is quite similar - what's the difference with the feed one again? Since the Tika parser would handle all sorts of feed formats why not simply rely on it? I wrote parse-rss back in 2005, and used commons-feedparser from Kevin Burton and his crew. At the time it was well developed, and a little more flexible and easier for me to pick up than Rome. Since then however, its development has really become stagnant and it is no longer maintained. In terms of real differences in terms of functionality, they are roughly equivalent so there isn't much difference. I would suggest we move forward with the feed plugin in Tika and roll it back in through Nutch.
        Julien Nioche created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Julien Nioche
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development