Okay, I've addressed the comments put forth by Andrzej Bialecki regarding the parse-rss plugin that I submitted, and the latest version of it is attached. Included in the zip file is basically the directory for the plugin, which is self-contained. I haven't attached the svn diff/patch file with this, but here are the following things that would need to be done to include this in the source tree:
1. enable/add the plugin in nutch-default.xml (the name of the plugin is parse-rss)
2. add the plugin parse-rss in src/plugin/build.xml in the "deploy", "test" and "clean" targets
3. edit conf/mime.types to add the following two lines in the file:
4. That's it.
I can attach an SVN diff of this from the latest nutch snapshot later on tonight, however just to note, this attachment includes the following updates:
> * package names follow the old naming, the new naming is under
Fixed, tested, and included in the zip file
> * in RSSParser.java, you retrieve contentLength, but the code never
> uses it.
Took this part out: included in the zip file.
> * lines 149-160 seem a bit bogus to me. As I understand the RSS spec,
> the item's permalink should be preferred if present, but it's not an
> error if it's absent (as signified by a null value, which currently
> causes MalformedURLException to be thrown) - in such case the getLink
> should be used instead. The message in line 157 is wrong, too, because
> it prints the url of the channel, and not the current item. When it's
> fixed, it would be also good to demonstrate such fallback in the test
Fixed this part: now, I first check for the perm link, if it's null, I try the regular "link" field, and if there's nothing there, then I make it choke (i.e. throw MalformedURLException). I hope it's a bit more robust
> * I'm not sure what is the purpose of copying through the metadata -
> the code doesn't modify the copy, so you could as well use the
> original, right?
> * probably just a matter of programming style, but I'm always somewhat
> vexed by frequent String concatenations, especially in a "for" loop -
> like in the code that creates the title and the body. StringBuffer-s
> would be a good fit here...
it now uses StringBuffer.append instead of concats
> * IMHO it's better to put the various intermediate diagnostic output
> from the plugin under LOG.fine(), to reduce the amount of information
> to be logged. The final result of processing the content could be put
> LOG.info() or LOG.warn(), depending on the final result. (I personally
> favor no output whatsoever if everything went ok).
Yup, I agree. I've put all not necessary (i.e. non exception) logging to go to LOG.fine, and included in the zip file
> And lastly, a minor thing, but still... the formatting style and
> indentation in most files doesn't adhere to the Nutch coding style
> when it comes to whitespace rules - please see e.g. WebDBReader.java
> as a reference. This especially concerns the whitespace around curly
> braces and assignments, and the use of literal Tab instead of 4
> spaces. This is easy to fix with an IDE, but it helps a lot when
> someone else is reading the code...