Details
Description
An exhaustive test to check the matrix of CrawlDatum state transitions (CrawlStatus in 2.x) would be useful to detect errors esp. for continuous crawls where the number of possible transitions is quite large. Additional factors with impact on state transitions (retry counters, static and dynamic intervals) are also tested.
The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter for a first sketchy patch.
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-1564 AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified
- Open
-
NUTCH-1422 bypass signature comparison when a document is redirected
- Closed
Patch which adds the following test units:
NUTCH-1245)NUTCH-933)NUTCH-578)NUTCH-1422)The latter four points are open issues, the corresponding tests are in a separate TODO test class or marked as such. The tests should make it easier to find a solutions for these issues: they are now reproducible. That's the main improvement: the tests log lot of information which makes it possible to understand what's going wrong. Since these problems happen only after a long time it's hard to do the investigations in real crawls (need to check dozens of segments).