Description
found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com
i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url
so i replace these lines:
URL url = URLUtil.resolveURL(base, target);
outlinks.add(new Outlink(url.toString(),
linkText.toString().trim()));
with:
String host_temp=null;
try
catch(Exception eiuy)
{ host_temp=null; } URL url=null;
if(host_temp==null)// it is an internal outlink
url = URLUtil.resolveURL(base, target);
else //it is an external link
url=new URL(target);
outlinks.add(new Outlink(url.toString(),
linkText.toString().trim()));