Nutch
  1. Nutch
  2. NUTCH-1329

parser not extract outlinks to external web sites

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.4
    • Fix Version/s: 2.3, 1.8
    • Component/s: parser
    • Labels:

      Description

      found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com
      i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url
      so i replace these lines:
      URL url = URLUtil.resolveURL(base, target);
      outlinks.add(new Outlink(url.toString(),
      linkText.toString().trim()));

      with:
      String host_temp=null;
      try

      { host_temp=URLUtil.getDomainName(new URL(target)); }

      catch(Exception eiuy)

      { host_temp=null; }

      URL url=null;
      if(host_temp==null)// it is an internal outlink
      url = URLUtil.resolveURL(base, target);
      else //it is an external link
      url=new URL(target);
      outlinks.add(new Outlink(url.toString(),
      linkText.toString().trim()));

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Unassigned
            Reporter:
            behnam nikbakht
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development