Description
I've encountered several cases that outlinks are not extracted correctly. Most are cause by the use of URI.resolve().
1. For a base URI of new URI("http://www.domain.com"), <a href="test.html">test.html</a> will be resolved to http://www.domain.comtest.html
2. For a base URI of new URI("http://www.domain.com/index.php"), <a href="?test=true">test with param</a> will be resolved to http://www.domain.com/?test=true
3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will throw exception. And in a browser, it can resolves the URI. (remarks: I didn't check if this scenario affect the default Tika/NekoHTML parsing. )
I suspect there are many different scenarios, many of them are probably caused by non-standard usage. (but a crawler has to handle non-standard usage in order to function) Obviously, we cannot cater every case, and I suggest to consider a resolve failure as a bug if a link works in a Mozilla browser but not in Droids LinkExtractor.
this issue is related to the LinkExtractor created in DROIDS-8
Attachments
Attachments
Issue Links
- is related to
-
DROIDS-11 Extract the OutgoingLinks task from Parser interface
- Resolved