[NUTCH-2377] Nutch can't parse relative links - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Not A Problem
Affects Version/s: 2.3
Fix Version/s: 2.3.1
Component/s: parser
Labels:
None
Environment:

centos 7, hbase 0.98

Flags:

Important

Description

Testing with the following site: https://www.ouedkniss.com, nutch only parse links that does contain the base url.
Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried practically every comments about detecting all the links, doubted urlfilter or regex-normalizer so it was disabled but having the same results.
each time I rebuild nutch and test the parser, it gives the same urls count arround 378.
Can somebody help out to fix this.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: hakim

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Apr/17 23:12

Updated:: 13/Mar/24 14:50

Resolved:: 03/May/17 20:18