[NUTCH-2973] Single domain names (eg https://localnet) can't be crawled - filtering fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.19
Fix Version/s: 1.20
Component/s: fetcher
Labels:
None
Environment:

Hide

Nutch 1.19, checked on Windows 10 and Ubuntu. Both have the same issue.

'm trying to crawl a SharePoint intranet using nutch where the URLs are similar to:

https://localnet/something.aspx

The issue is that Nutch is rejecting any url with a single element domain name such as localnet above. "localnet.com" is not rejected, nor is "local.localnet". It almost feels as if there's a chunk of code within Nutch that's unrelated to the filtering mechanisms that rejects URLs outright if they don't have a WWW style format and a WWW-style domain such as .COM

Error message:

Total urls rejected by filters: 1

I've checked and updated all the filter files in the conf directory. Even making then incredibly permissive (effectively "crawl everything") has not helped.

Show
Nutch 1.19, checked on Windows 10 and Ubuntu. Both have the same issue. 'm trying to crawl a SharePoint intranet using nutch where the URLs are similar to: https://localnet/something.aspx The issue is that Nutch is rejecting any url with a single element domain name such as localnet above. "localnet.com" is not rejected, nor is "local.localnet". It almost feels as if there's a chunk of code within Nutch that's unrelated to the filtering mechanisms that rejects URLs outright if they don't have a WWW style format and a WWW-style domain such as .COM Error message: Total urls rejected by filters: 1 I've checked and updated all the filter files in the conf directory. Even making then incredibly permissive (effectively "crawl everything") has not helped.

Description

There appears to be a bug within the core of Nutch that fails to permit any single domain name URLs to be crawled. Example:

https://localnet/something.aspx

The issue is that Nutch is rejecting any url with a single element domain name such as localnet above. "localnet.com" is not rejected, nor is "local.localnet". It almost feels as if there's a chunk of code within Nutch that's unrelated to the filtering mechanisms that rejects URLs outright if they don't have a WWW style format and a WWW-style domain such as .COM

Error message:

Total urls rejected by filters: 1

I've checked and updated all the filter files in the conf directory. Even making then incredibly permissive (effectively "crawl everything") has not helped. Immediately that a dot (.) is added to the domain name it is not rejected - eg blah.localnet.

Attachments

Issue Links

is superceded by

NUTCH-2985 Disable plugin urlfilter-validator by default

Closed

Activity

People

Assignee:: Sebastian Nagel

Reporter:: David Smith

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Oct/22 02:05

Updated:: 13/Mar/24 14:51

Resolved:: 06/Mar/23 12:22