Nutch
  1. Nutch
  2. NUTCH-120

one "bad" link on a page kills parsing

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 1.0.0
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      ubuntu 5.10

      Description

      Since the try in src/java/org/apache/nutch/parse/OutlinkExtractor.java, getOutlinks method loops around the whole

      while (matcher.contains(input, pattern)) {
      ...
      }

      loop, if one url causes an exception, no more links will be extracted.

        Activity

        Hide
        Earl Cahill added a comment -

        This patch try's for each url, so one exception throwing url on a page shouldn't be fatal

        Index: src/java/org/apache/nutch/parse/OutlinkExtractor.java
        ===================================================================
        — src/java/org/apache/nutch/parse/OutlinkExtractor.java (revision 326762)
        +++ src/java/org/apache/nutch/parse/OutlinkExtractor.java (working copy)
        @@ -97,7 +97,11 @@
        while (matcher.contains(input, pattern)) {
        result = matcher.getMatch();
        url = result.group(0);

        • outlinks.add(new Outlink(url, anchor));
          + try { + outlinks.add(new Outlink(url, anchor)); + }

          catch (Exception ex)

          { + LOG.throwing(OutlinkExtractor.class.getName(), "getOutlinks", ex); + }

          }
          } catch (Exception ex) {
          // if it is a malformed URL we just throw it away and continue with

        Show
        Earl Cahill added a comment - This patch try's for each url, so one exception throwing url on a page shouldn't be fatal Index: src/java/org/apache/nutch/parse/OutlinkExtractor.java =================================================================== — src/java/org/apache/nutch/parse/OutlinkExtractor.java (revision 326762) +++ src/java/org/apache/nutch/parse/OutlinkExtractor.java (working copy) @@ -97,7 +97,11 @@ while (matcher.contains(input, pattern)) { result = matcher.getMatch(); url = result.group(0); outlinks.add(new Outlink(url, anchor)); + try { + outlinks.add(new Outlink(url, anchor)); + } catch (Exception ex) { + LOG.throwing(OutlinkExtractor.class.getName(), "getOutlinks", ex); + } } } catch (Exception ex) { // if it is a malformed URL we just throw it away and continue with
        Hide
        Paul Baclace added a comment -

        Indeed there is a comment that indicates the code keeps trying, but luckily it does not, and it might be unwise to keep trying after the occurrence of any subclass of Exception. If the catch were more specific, then perhaps continuing is feasible. If NPE occurred, continuing could be a recipe for infinite loop.

        I just noticed this same code passage because under some conditions OutlinkExtractor.getOutlinks(text) is taking 10 hours to R.E. scan one file because it was given a non-plain text file.

        Recommend: not a bug

        Show
        Paul Baclace added a comment - Indeed there is a comment that indicates the code keeps trying, but luckily it does not, and it might be unwise to keep trying after the occurrence of any subclass of Exception. If the catch were more specific, then perhaps continuing is feasible. If NPE occurred, continuing could be a recipe for infinite loop. I just noticed this same code passage because under some conditions OutlinkExtractor.getOutlinks(text) is taking 10 hours to R.E. scan one file because it was given a non-plain text file. Recommend: not a bug
        Hide
        Earl Cahill added a comment -

        I can't really explain what was happening, but for a time, many valid links would throw an exception. Then it just stopped. I think we don't really know what is going on in the code. LIke, what really causes an exception to get thrown? I don't see the possibility for an infinite loop.

        I for one still don't trust that links that throw an exception are really problematic, and think that having one such link shouldn't stop parsing. I am guessing that failed links aren't recorded or generally reviewed, so I see this as a place that parsing and crawling could fail and it would be pretty hard to track down. Just seems a little too unforgiving.

        Show
        Earl Cahill added a comment - I can't really explain what was happening, but for a time, many valid links would throw an exception. Then it just stopped. I think we don't really know what is going on in the code. LIke, what really causes an exception to get thrown? I don't see the possibility for an infinite loop. I for one still don't trust that links that throw an exception are really problematic, and think that having one such link shouldn't stop parsing. I am guessing that failed links aren't recorded or generally reviewed, so I see this as a place that parsing and crawling could fail and it would be pretty hard to track down. Just seems a little too unforgiving.
        Hide
        Andrzej Bialecki added a comment -

        This has been fixed as a part of another commit.

        Show
        Andrzej Bialecki added a comment - This has been fixed as a part of another commit.

          People

          • Assignee:
            Unassigned
            Reporter:
            Earl Cahill
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development