Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1502

Test for CrawlDatum state transitions

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7, 2.2
    • 1.9
    • crawldb
    • None
    • Patch Available

    Description

      An exhaustive test to check the matrix of CrawlDatum state transitions (CrawlStatus in 2.x) would be useful to detect errors esp. for continuous crawls where the number of possible transitions is quite large. Additional factors with impact on state transitions (retry counters, static and dynamic intervals) are also tested.
      The tests will help to address the NUTCH-578 and NUTCH-1245. See the latter for a first sketchy patch.

      Attachments

        1. NUTCH-1502-trunk-v1.patch
          43 kB
          Sebastian Nagel

        Issue Links

          Activity

            snagel Sebastian Nagel added a comment - - edited

            Patch which adds the following test units:

            • test matrix of state transitions with
              • CrawlDbReducer and InjectReducer
              • Default and AdaptiveFetchSchedule
            • fetch_gone -> db_gone (NUTCH-1245)
            • not modified time (cf. NUTCH-933)
            • fetch_retry -> db_gone after max retries (NUTCH-578)
            • immediate refetch by sync_delta of AdaptiveFetchSchedule (NUTCH-1564)
            • signature reset / erroneous db_notmodified (NUTCH-1422)

            The latter four points are open issues, the corresponding tests are in a separate TODO test class or marked as such. The tests should make it easier to find a solutions for these issues: they are now reproducible. That's the main improvement: the tests log lot of information which makes it possible to understand what's going wrong. Since these problems happen only after a long time it's hard to do the investigations in real crawls (need to check dozens of segments).

            snagel Sebastian Nagel added a comment - - edited Patch which adds the following test units: test matrix of state transitions with CrawlDbReducer and InjectReducer Default and AdaptiveFetchSchedule fetch_gone -> db_gone ( NUTCH-1245 ) not modified time (cf. NUTCH-933 ) fetch_retry -> db_gone after max retries ( NUTCH-578 ) immediate refetch by sync_delta of AdaptiveFetchSchedule ( NUTCH-1564 ) signature reset / erroneous db_notmodified ( NUTCH-1422 ) The latter four points are open issues, the corresponding tests are in a separate TODO test class or marked as such. The tests should make it easier to find a solutions for these issues: they are now reproducible. That's the main improvement: the tests log lot of information which makes it possible to understand what's going wrong. Since these problems happen only after a long time it's hard to do the investigations in real crawls (need to check dozens of segments).
            jnioche Julien Nioche added a comment -

            great stuff! I'd rename the *TODO class into org.apache.nutch.crawl.TODOTestCrawlDbStates: this way it will not get included in the test suite. It currently fails as the corresponding issues are not fixed.

            jnioche Julien Nioche added a comment - great stuff! I'd rename the *TODO class into org.apache.nutch.crawl.TODOTestCrawlDbStates: this way it will not get included in the test suite. It currently fails as the corresponding issues are not fixed.
            jnioche Julien Nioche added a comment -

            Trunk : committed revision 1610628.

            Will port the tests from TODOTestCrawlDbStates.java as part of the resolution of the related issues.

            Let's open a separate issue for porting this to 2.x (if we want to do that)

            Thanks Seb, this is a great addition.

            jnioche Julien Nioche added a comment - Trunk : committed revision 1610628. Will port the tests from TODOTestCrawlDbStates.java as part of the resolution of the related issues. Let's open a separate issue for porting this to 2.x (if we want to do that) Thanks Seb, this is a great addition.
            hudson Hudson added a comment -

            SUCCESS: Integrated in Nutch-trunk #2704 (See https://builds.apache.org/job/Nutch-trunk/2704/)
            NUTCH-1502 Test for CrawlDatum state transitions (snagel) (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1610628)

            • /nutch/trunk/CHANGES.txt
            • /nutch/trunk/src/test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java
            • /nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java
            • /nutch/trunk/src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
            • /nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbStates.java
            hudson Hudson added a comment - SUCCESS: Integrated in Nutch-trunk #2704 (See https://builds.apache.org/job/Nutch-trunk/2704/ ) NUTCH-1502 Test for CrawlDatum state transitions (snagel) (jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1610628 ) /nutch/trunk/CHANGES.txt /nutch/trunk/src/test/org/apache/nutch/crawl/ContinuousCrawlTestUtil.java /nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java /nutch/trunk/src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java /nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbStates.java

            Bulk progression of all “Resolved” issues to “Closed”. Performed my lewismc on 2024-03-13.

            lewismc Lewis John McGibbney added a comment - Bulk progression of all “Resolved” issues to “Closed”. Performed my lewismc on 2024-03-13.

            People

              Unassigned Unassigned
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: