[CONNECTORS-1602] Continuous crawling doesn't recrawl everything - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: Web connector
Labels:
None

Description

When crawling a website in continuous crawling mode we saw that not all documents are recrawled.

The site is quite extensive. We figured out that after crawling a document/page gets a recrawl timestamp in between the recrawl interval and max recrawl interval.

But if these values occur within the first crawl, Manifold starts recrawling those, but seems to ignore the rest of the website. Also sometimes documents get recrawled 5 times while other don't get recrawled. Apparently due to the same issue.

Is it possible to shed a bit more light on the continuous crawling?

Is it a good system to use for crawling a (extensive) website?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Donald Van den Driessche

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Apr/19 09:29

Updated:: 01/May/19 11:08

Resolved:: 01/May/19 11:08