Nutch
  1. Nutch
  2. NUTCH-1329

parser not extract outlinks to external web sites

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.4
    • Fix Version/s: 2.3, 1.8
    • Component/s: parser
    • Labels:

      Description

      found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com
      i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url
      so i replace these lines:
      URL url = URLUtil.resolveURL(base, target);
      outlinks.add(new Outlink(url.toString(),
      linkText.toString().trim()));

      with:
      String host_temp=null;
      try

      { host_temp=URLUtil.getDomainName(new URL(target)); }

      catch(Exception eiuy)

      { host_temp=null; }

      URL url=null;
      if(host_temp==null)// it is an internal outlink
      url = URLUtil.resolveURL(base, target);
      else //it is an external link
      url=new URL(target);
      outlinks.add(new Outlink(url.toString(),
      linkText.toString().trim()));

        Activity

        Hide
        Tejas Patil added a comment -

        I am not able to reproduce this bug with the default config. Are there any specific configs that you were using ?

        Show
        Tejas Patil added a comment - I am not able to reproduce this bug with the default config. Are there any specific configs that you were using ?
        Hide
        Tejas Patil added a comment -

        Should we close this one ?
        I had tried to reproduce this but didn't got reproduced. If there were some special configs used, we don't have those (had asked those 4 months back. see comment above).

        Show
        Tejas Patil added a comment - Should we close this one ? I had tried to reproduce this but didn't got reproduced. If there were some special configs used, we don't have those (had asked those 4 months back. see comment above).
        Hide
        Lewis John McGibbney added a comment -

        +1 Close. We can reopen if it is reported again.

        Show
        Lewis John McGibbney added a comment - +1 Close. We can reopen if it is reported again.
        Hide
        Tejas Patil added a comment -

        Closing for now by marking it "cannot reproduce"

        Show
        Tejas Patil added a comment - Closing for now by marking it "cannot reproduce"

          People

          • Assignee:
            Unassigned
            Reporter:
            behnam nikbakht
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development