Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: fetcher
    • Labels:
      None

      Description

      A simple rss feed parser supporting:
      rss and atom:

      + version 0.3
      + version 09
      + version 10
      + version 20

      Converting of different rss versions is done via xslt.
      The xslt was contributed by Frank Henze - Thanks!

      1. parseRss.zip
        1.91 MB
        Hasan Diwan
      2. parse-rss.zip
        1.92 MB
        Chris A. Mattmann
      3. parse-rss-1.0-040605.zip
        1.93 MB
        Chris A. Mattmann
      4. parse-rss-73005.zip
        1.96 MB
        Chris A. Mattmann
      5. parse-rss-patch.txt
        109 kB
        Chris A. Mattmann
      6. parse-rss-srcbin-incl-path.zip
        1.96 MB
        Chris A. Mattmann
      7. RSS_Parser.zip
        47 kB
        Stefan Groschupf
      8. RSSParserPatch.txt
        15 kB
        Stefan Groschupf

        Activity

        Hide
        Stefan Groschupf added a comment -

        + rss parser patch against latest sources
        + binary and source dist are in the zip

        Show
        Stefan Groschupf added a comment - + rss parser patch against latest sources + binary and source dist are in the zip
        Hide
        Chris A. Mattmann added a comment -

        An RSS Parser plugin based on the Apache Commons-Feedparser RSS Parsing library. From the feedparser site (http://jakarta.apache.org/commons/sandbox/feedparser/):

        Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability.

        FeedParser was the parser API designed by Kevin Burton for NewsMonster and has been donated to the ASF in order to continue development.

        FeedParser differs from most other RSS/Atom parsers in that it is not DOM based but event based (similar to SAX). Instead of the low level startElement() API present in SAX, we provide higher level events based on feed parsing information.

        Events are also given to the caller independent of the underlying format. This is accomplished with a Feed Event Model that isolates your application from the underlying feed format. This enables transparent support for all RSS versions including Atom. We also hide format specific implementation such as dates (RFC 822 in RSS 2.0 and 0.9x and ISO 8601 in RSS 1.0 and Atom) and other metadata.

        The FeedParser distribution also includes:

        1. An implementation of RSS and Atom autodiscovery.
        2. Support for all content modules including xhtml:body, mod_content (RDF and inline), atom:content, and atom:summary
        3. Atom 1.0 link API as well as RSS 1.0 mod_link API
        4. An HTML link parser for finding all links in an HTML source file and expanding them to become full URLs instead of relative.

        I've included the zipped up parse-rss nutch plugin, along with a patch file generated from svn diff. Hopefully you guys will find the parser useful, and please vote for it if you would like to include it in the Nutch source tree.

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - An RSS Parser plugin based on the Apache Commons-Feedparser RSS Parsing library. From the feedparser site ( http://jakarta.apache.org/commons/sandbox/feedparser/): Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability. FeedParser was the parser API designed by Kevin Burton for NewsMonster and has been donated to the ASF in order to continue development. FeedParser differs from most other RSS/Atom parsers in that it is not DOM based but event based (similar to SAX). Instead of the low level startElement() API present in SAX, we provide higher level events based on feed parsing information. Events are also given to the caller independent of the underlying format. This is accomplished with a Feed Event Model that isolates your application from the underlying feed format. This enables transparent support for all RSS versions including Atom. We also hide format specific implementation such as dates (RFC 822 in RSS 2.0 and 0.9x and ISO 8601 in RSS 1.0 and Atom) and other metadata. The FeedParser distribution also includes: 1. An implementation of RSS and Atom autodiscovery. 2. Support for all content modules including xhtml:body, mod_content (RDF and inline), atom:content, and atom:summary 3. Atom 1.0 link API as well as RSS 1.0 mod_link API 4. An HTML link parser for finding all links in an HTML source file and expanding them to become full URLs instead of relative. I've included the zipped up parse-rss nutch plugin, along with a patch file generated from svn diff. Hopefully you guys will find the parser useful, and please vote for it if you would like to include it in the Nutch source tree. Cheers, Chris
        Hide
        Chris A. Mattmann added a comment -

        Hi Folks,

        One more comment on the parse-rss plugin that I've just attached: I also included a junit test drawn from John X's junit test on the PDF parser. The rss junit test parses a sample rss file and makes sure it reads the correct amount of outlinks, and the correct outlinks from the RSS file.

        Thanks,
        Chris

        Show
        Chris A. Mattmann added a comment - Hi Folks, One more comment on the parse-rss plugin that I've just attached: I also included a junit test drawn from John X's junit test on the PDF parser. The rss junit test parses a sample rss file and makes sure it reads the correct amount of outlinks, and the correct outlinks from the RSS file. Thanks, Chris
        Hide
        Chris A. Mattmann added a comment -

        One more comment: the files that I submitted were:

        parse-rss-patch.txt
        parse-rss.zip

        The other two files are the files needed for Stefan's submitted rss parser plugin.

        Thanks,
        Chris

        Show
        Chris A. Mattmann added a comment - One more comment: the files that I submitted were: parse-rss-patch.txt parse-rss.zip The other two files are the files needed for Stefan's submitted rss parser plugin. Thanks, Chris
        Hide
        Andrzej Bialecki added a comment -

        A couple of comments:

        • there was a recent thread on the list discussing a motion to reduce dependencies on external XML-related libraries, and instead to rely on the JDK-supplied XML parser. I noticed that your plugin uses dom4j. I'm not sure how easy/feasible/sensible would be to get rid of that dependency...?
        • the parse-rss-patch.txt patch introduces huge white-space diffs. This is confusing, and it should be removed before the patch is applied.
        • the method transformDocument creates a new Transformer for every call. Transformers are not multithreaded, so we cannot use just a single instance, but each transformation can be a CPU and time-intensive process. Perhaps this should be converted to use a pool of pre-allocated Transformers, as many as fetcher.threads conf. variable.
        Show
        Andrzej Bialecki added a comment - A couple of comments: there was a recent thread on the list discussing a motion to reduce dependencies on external XML-related libraries, and instead to rely on the JDK-supplied XML parser. I noticed that your plugin uses dom4j. I'm not sure how easy/feasible/sensible would be to get rid of that dependency...? the parse-rss-patch.txt patch introduces huge white-space diffs. This is confusing, and it should be removed before the patch is applied. the method transformDocument creates a new Transformer for every call. Transformers are not multithreaded, so we cannot use just a single instance, but each transformation can be a CPU and time-intensive process. Perhaps this should be converted to use a pool of pre-allocated Transformers, as many as fetcher.threads conf. variable.
        Hide
        Kevin Burton added a comment -

        I'd recommend against NOT using an XSLT approach to your RSS issue. Use the FeedParser. There are a score of issues that an XSLT approach won't fix.

        WRT external XML libraries we're using Jaxen and JDOM right now. FeedParser 2.0 won't have any external libraries and will only use SAX internally.

        I'd just bite the bullet and take the dependencies because IMO you won't find another parser that comes anywhere close to solving all the issues that the FeedParser will. Rome comes close though but then you'd have the same amount of dependencies.

        Kevin

        Show
        Kevin Burton added a comment - I'd recommend against NOT using an XSLT approach to your RSS issue. Use the FeedParser. There are a score of issues that an XSLT approach won't fix. WRT external XML libraries we're using Jaxen and JDOM right now. FeedParser 2.0 won't have any external libraries and will only use SAX internally. I'd just bite the bullet and take the dependencies because IMO you won't find another parser that comes anywhere close to solving all the issues that the FeedParser will. Rome comes close though but then you'd have the same amount of dependencies. Kevin
        Hide
        Chris A. Mattmann added a comment -

        Hi Folks,

        Okay, I've addressed the comments put forth by Andrzej Bialecki regarding the parse-rss plugin that I submitted, and the latest version of it is attached. Included in the zip file is basically the directory for the plugin, which is self-contained. I haven't attached the svn diff/patch file with this, but here are the following things that would need to be done to include this in the source tree:

        1. enable/add the plugin in nutch-default.xml (the name of the plugin is parse-rss)
        2. add the plugin parse-rss in src/plugin/build.xml in the "deploy", "test" and "clean" targets
        3. edit conf/mime.types to add the following two lines in the file:

        application/rss+xml rss
        text/xml rss

        4. That's it.

        I can attach an SVN diff of this from the latest nutch snapshot later on tonight, however just to note, this attachment includes the following updates:

        > * package names follow the old naming, the new naming is under
        > org.apache.nutch.*

        Fixed, tested, and included in the zip file
        >
        > * in RSSParser.java, you retrieve contentLength, but the code never
        > uses it.

        Took this part out: included in the zip file.

        >
        > * lines 149-160 seem a bit bogus to me. As I understand the RSS spec,
        > the item's permalink should be preferred if present, but it's not an
        > error if it's absent (as signified by a null value, which currently
        > causes MalformedURLException to be thrown) - in such case the getLink
        > should be used instead. The message in line 157 is wrong, too, because
        > it prints the url of the channel, and not the current item. When it's
        > fixed, it would be also good to demonstrate such fallback in the test
        > case.

        Fixed this part: now, I first check for the perm link, if it's null, I try the regular "link" field, and if there's nothing there, then I make it choke (i.e. throw MalformedURLException). I hope it's a bit more robust

        >
        > * I'm not sure what is the purpose of copying through the metadata -
        > the code doesn't modify the copy, so you could as well use the
        > original, right?

        took out

        >
        > * probably just a matter of programming style, but I'm always somewhat
        > vexed by frequent String concatenations, especially in a "for" loop -
        > like in the code that creates the title and the body. StringBuffer-s
        > would be a good fit here...

        it now uses StringBuffer.append instead of concats

        >
        > * IMHO it's better to put the various intermediate diagnostic output
        > from the plugin under LOG.fine(), to reduce the amount of information
        > to be logged. The final result of processing the content could be put
        > under
        > LOG.info() or LOG.warn(), depending on the final result. (I personally
        > favor no output whatsoever if everything went ok).

        Yup, I agree. I've put all not necessary (i.e. non exception) logging to go to LOG.fine, and included in the zip file

        >
        > And lastly, a minor thing, but still... the formatting style and
        > indentation in most files doesn't adhere to the Nutch coding style
        > when it comes to whitespace rules - please see e.g. WebDBReader.java
        > as a reference. This especially concerns the whitespace around curly
        > braces and assignments, and the use of literal Tab instead of 4
        > spaces. This is easy to fix with an IDE, but it helps a lot when
        > someone else is reading the code...

        Fixed.

        Show
        Chris A. Mattmann added a comment - Hi Folks, Okay, I've addressed the comments put forth by Andrzej Bialecki regarding the parse-rss plugin that I submitted, and the latest version of it is attached. Included in the zip file is basically the directory for the plugin, which is self-contained. I haven't attached the svn diff/patch file with this, but here are the following things that would need to be done to include this in the source tree: 1. enable/add the plugin in nutch-default.xml (the name of the plugin is parse-rss) 2. add the plugin parse-rss in src/plugin/build.xml in the "deploy", "test" and "clean" targets 3. edit conf/mime.types to add the following two lines in the file: application/rss+xml rss text/xml rss 4. That's it. I can attach an SVN diff of this from the latest nutch snapshot later on tonight, however just to note, this attachment includes the following updates: > * package names follow the old naming, the new naming is under > org.apache.nutch.* Fixed, tested, and included in the zip file > > * in RSSParser.java, you retrieve contentLength, but the code never > uses it. Took this part out: included in the zip file. > > * lines 149-160 seem a bit bogus to me. As I understand the RSS spec, > the item's permalink should be preferred if present , but it's not an > error if it's absent (as signified by a null value, which currently > causes MalformedURLException to be thrown) - in such case the getLink > should be used instead. The message in line 157 is wrong, too, because > it prints the url of the channel, and not the current item. When it's > fixed, it would be also good to demonstrate such fallback in the test > case. Fixed this part: now, I first check for the perm link, if it's null, I try the regular "link" field, and if there's nothing there, then I make it choke (i.e. throw MalformedURLException). I hope it's a bit more robust > > * I'm not sure what is the purpose of copying through the metadata - > the code doesn't modify the copy, so you could as well use the > original, right? took out > > * probably just a matter of programming style, but I'm always somewhat > vexed by frequent String concatenations, especially in a "for" loop - > like in the code that creates the title and the body. StringBuffer-s > would be a good fit here... it now uses StringBuffer.append instead of concats > > * IMHO it's better to put the various intermediate diagnostic output > from the plugin under LOG.fine(), to reduce the amount of information > to be logged. The final result of processing the content could be put > under > LOG.info() or LOG.warn(), depending on the final result. (I personally > favor no output whatsoever if everything went ok). Yup, I agree. I've put all not necessary (i.e. non exception) logging to go to LOG.fine, and included in the zip file > > And lastly, a minor thing, but still... the formatting style and > indentation in most files doesn't adhere to the Nutch coding style > when it comes to whitespace rules - please see e.g. WebDBReader.java > as a reference. This especially concerns the whitespace around curly > braces and assignments, and the use of literal Tab instead of 4 > spaces. This is easy to fix with an IDE, but it helps a lot when > someone else is reading the code... Fixed.
        Hide
        John Xing added a comment -

        Could we have an updated patch & zip against most recent svn?
        Also I am not sure it is a good idea to have parse-rss capture
        any mime type text/xml. Maybe more specific magic entries should
        be introduced in ./conf/mime-types.xml for rss?

        John

        Show
        John Xing added a comment - Could we have an updated patch & zip against most recent svn? Also I am not sure it is a good idea to have parse-rss capture any mime type text/xml. Maybe more specific magic entries should be introduced in ./conf/mime-types.xml for rss? John
        Hide
        Chris A. Mattmann added a comment -

        Hi John,

        Here ya go. The zip file includes:

        1. up-to-date zipped up src of the plugin, incl. required binary jars in lib directory, tested against the latest SVN of nutch
        2. text output of running the unit tests to test the plugin
        3. patch file against the latest SVN source.

        You should be good to go with this. Let me know if there are any troubles.

        BTW, I changed the content type in the plugin.xml to be "application/rss+xml" as oppossed to "text/xml", as it was before. I'm sure we'll need to think more about what the most appropriate seting for this is, but for now, it should be fine (as it can always be tailored to the user's env by changing the attribute).

        Take care and thanks!

        Cheers,
        Chris

        Show
        Chris A. Mattmann added a comment - Hi John, Here ya go. The zip file includes: 1. up-to-date zipped up src of the plugin, incl. required binary jars in lib directory, tested against the latest SVN of nutch 2. text output of running the unit tests to test the plugin 3. patch file against the latest SVN source. You should be good to go with this. Let me know if there are any troubles. BTW, I changed the content type in the plugin.xml to be "application/rss+xml" as oppossed to "text/xml", as it was before. I'm sure we'll need to think more about what the most appropriate seting for this is, but for now, it should be fine (as it can always be tailored to the user's env by changing the attribute). Take care and thanks! Cheers, Chris
        Hide
        Hasan Diwan added a comment -

        I've fixed this plugin for the apache incubator. It now compiles. There are no further changes. I'm attaching the new zip.

        Show
        Hasan Diwan added a comment - I've fixed this plugin for the apache incubator. It now compiles. There are no further changes. I'm attaching the new zip.
        Hide
        Michael Nebel added a comment -

        I loaded the latest sources from the svn yesterday and tried to integrate this plugin (I used the Zip from Hasan) . I found:

        • getParse throws a ParseException which isn't supported by getParse
        • the call to new ParseData needs a new parameter "ParseStatus"

        My fixes are far from perfect (I just identified the problems by now), so I'm not creating a patch.

        Show
        Michael Nebel added a comment - I loaded the latest sources from the svn yesterday and tried to integrate this plugin (I used the Zip from Hasan) . I found: getParse throws a ParseException which isn't supported by getParse the call to new ParseData needs a new parameter "ParseStatus" My fixes are far from perfect (I just identified the problems by now), so I'm not creating a patch.
        Hide
        Chris A. Mattmann added a comment -

        An updated patch and source distribution for the parse-rss plugin. The latest patch and source work against the new protocol and parsing APIs by Andrzej. The patch was made against the latest SVN from 73005.

        Thanks!

        Cheers,
        Chris Mattmann

        Show
        Chris A. Mattmann added a comment - An updated patch and source distribution for the parse-rss plugin. The latest patch and source work against the new protocol and parsing APIs by Andrzej. The patch was made against the latest SVN from 73005. Thanks! Cheers, Chris Mattmann
        Hide
        Andrzej Bialecki added a comment -

        Committed to trunk. Thank you!

        Show
        Andrzej Bialecki added a comment - Committed to trunk. Thank you!

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Stefan Groschupf
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development