Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-89

parse-rss null pointer exception

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7, 0.8
    • 0.7, 0.8
    • fetcher
    • None

    Description

      The rss-parser causes an exception. The reason is a syntax error in the page. Hitting this pages, the parser trys to add an outlink with "null" as anchor. The anchor of a outlink must no be null.

      java.lang.NullPointerException
      at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
      at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
      at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
      at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
      at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
      at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
      at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
      at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
      at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
      Exception in thread "main" java.lang.RuntimeException: SEVERE error logged. Exiting fetcher.
      at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
      at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
      at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)

      I suggest the following patch:

      Index: src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java
      ===================================================================
      — src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java (revision 279397)
      +++ src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java (working copy)
      @@ -157,11 +157,13 @@
      if (r.getLink() != null) {
      try {
      // get the outlink

      • theOutlinks.add(new Outlink(r.getLink(), r
      • .getDescription()));
        + if (r.getDescription()!= null ) { + theOutlinks.add(new Outlink(r.getLink(), r.getDescription())); + }

        else

        { + theOutlinks.add(new Outlink(r.getLink(), "")); + }

        } catch (MalformedURLException e) {

      • LOG
      • .info("nutch:parse-rss:RSSParser Exception: MalformedURL: "
        + LOG.info("nutch:parse-rss:RSSParser Exception: MalformedURL: "
        + r.getLink()
        + ": Attempting to continue processing outlinks");
        e.printStackTrace();
        @@ -185,12 +187,13 @@

      if (whichLink != null) {
      try {

      • theOutlinks.add(new Outlink(whichLink, theRSSItem
      • .getDescription()));
        -
        + if (theRSSItem.getDescription()!=null) { + theOutlinks.add(new Outlink(whichLink, theRSSItem.getDescription())); + }

        else

        { + theOutlinks.add(new Outlink(whichLink, "")); + }

        } catch (MalformedURLException e) {

      • LOG
      • .info("nutch:parse-rss:RSSParser Exception: MalformedURL: "
        + LOG.info("nutch:parse-rss:RSSParser Exception: MalformedURL: "
        + whichLink
        + ": Attempting to continue processing outlinks");
        e.printStackTrace();

      Attachments

        1. parse-rss.20050910.patch
          3 kB
          Michael Nebel

        Activity

          People

            Unassigned Unassigned
            mnebel Michael Nebel
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: