Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1329

parser not extract outlinks to external web sites

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 1.4
    • 2.3, 1.8
    • parser

    Description

      found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com
      i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url
      so i replace these lines:
      URL url = URLUtil.resolveURL(base, target);
      outlinks.add(new Outlink(url.toString(),
      linkText.toString().trim()));

      with:
      String host_temp=null;
      try

      { host_temp=URLUtil.getDomainName(new URL(target)); }

      catch(Exception eiuy)

      { host_temp=null; }

      URL url=null;
      if(host_temp==null)// it is an internal outlink
      url = URLUtil.resolveURL(base, target);
      else //it is an external link
      url=new URL(target);
      outlinks.add(new Outlink(url.toString(),
      linkText.toString().trim()));

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            behnam.nikbakht behnam nikbakht
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment