[NUTCH-2456] Allow to index pages/URLs not contained in CrawlDb - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.13
Fix Version/s: 1.14
Component/s: indexer
Labels:
None

Description

If http.redirect.max is set to a positive value, the Fetcher will follow redirects, creating a new CrawlDatum.
If the redirected URL is fetched and parsed, during indexing for it we have a special case: dbDatum is null. This means that in https://github.com/apache/nutch/blob/6199492f5e1e8811022257c88dbf63f1e1c739d0/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L259 the document is not indexed, as it is assumed it only has inlinks (actually it has everything but dbDatum).
I'm not sure what the correct fix is here. It seems to me the condition should use AND instead of OR anyway, but I may not understand the original intent. It is clear that it is too strict as is.
However, the code following that line assumes all 4 objects are not null, so a patch would need to change more than just the condition.

Attachments

Issue Links

causes

NUTCH-2526 NPE in scoring-opic when indexing document without CrawlDb datum

Closed

is part of

NUTCH-2184 Enable IndexingJob to function with no crawldb

Closed

links to

GitHub Pull Request #240

Activity

People

Assignee:: Unassigned

Reporter:: Yossi Tamari

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Nov/17 17:53

Updated:: 13/Mar/24 14:51

Resolved:: 05/Dec/17 09:41