Description
The rss-parser causes an exception. The reason is a syntax error in the page. Hitting this pages, the parser trys to add an outlink with "null" as anchor. The anchor of a outlink must no be null.
java.lang.NullPointerException
at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error logged. Exiting fetcher.
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)
I suggest the following patch:
Index: src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java
===================================================================
— src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java (revision 279397)
+++ src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java (working copy)
@@ -157,11 +157,13 @@
if (r.getLink() != null) {
try {
// get the outlink
- theOutlinks.add(new Outlink(r.getLink(), r
- .getDescription()));
+ if (r.getDescription()!= null ) { + theOutlinks.add(new Outlink(r.getLink(), r.getDescription())); + }else
{ + theOutlinks.add(new Outlink(r.getLink(), "")); + }} catch (MalformedURLException e) {
- LOG
- .info("nutch:parse-rss:RSSParser Exception: MalformedURL: "
+ LOG.info("nutch:parse-rss:RSSParser Exception: MalformedURL: "
+ r.getLink()
+ ": Attempting to continue processing outlinks");
e.printStackTrace();
@@ -185,12 +187,13 @@
if (whichLink != null) {
try {
- theOutlinks.add(new Outlink(whichLink, theRSSItem
- .getDescription()));
-
+ if (theRSSItem.getDescription()!=null) { + theOutlinks.add(new Outlink(whichLink, theRSSItem.getDescription())); + }else
{ + theOutlinks.add(new Outlink(whichLink, "")); + }} catch (MalformedURLException e) {
- LOG
- .info("nutch:parse-rss:RSSParser Exception: MalformedURL: "
+ LOG.info("nutch:parse-rss:RSSParser Exception: MalformedURL: "
+ whichLink
+ ": Attempting to continue processing outlinks");
e.printStackTrace();