Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.0
    • Fix Version/s: 0.8.0
    • Component/s: None

      Description

      The setup and initialization could be simplified by moving some common code to a base class or some helper functions

      1. kafka-502-v9.patch
        293 kB
        John Fung
      2. kafka-502-v8.patch
        293 kB
        John Fung
      3. kafka-502-v7.patch
        303 kB
        John Fung
      4. kafka-502-v6.patch
        326 kB
        John Fung
      5. kafka-502-v5.patch
        325 kB
        John Fung
      6. kafka-502-v4.patch
        325 kB
        John Fung
      7. kafka-502-v3.patch
        246 kB
        John Fung
      8. kafka-502-v2.patch
        242 kB
        John Fung
      9. kafka-502-v1.patch
        196 kB
        John Fung

        Issue Links

          Activity

          John Fung created issue -
          John Fung made changes -
          Field Original Value New Value
          Assignee John Fung [ jfung ]
          Joel Koshy made changes -
          Labels replication-testing
          Fix Version/s 0.8 [ 12317244 ]
          Affects Version/s 0.8 [ 12317244 ]
          Hide
          John Fung added a comment - - edited

          kafka-502-v1.patch

          This patch contains changes for the following JIRA:

          KAFKA-[441-448]
          KAFKA-488
          KAFKA-502
          KAFKA-503
          KAFKA-507
          KAFKA-526

          A. Adding ReplicaBasicTest test cases to cover KAFKA-[441~448]. There are 10 cases with failure and 10 cases without failure. These 10 cases are running in the same ReplicaBasicTest class with varying parameters to get different combinations of arguments such as compression, request-num-acks in Producer, no. of partition when creating topic, broker log segment size, ... etc.

          A.1. Add 10 test cases for ReplicaBasicTest without failure: testcase_0001 ~ testcase_0010

          A.2. Add 10 test cases for ReplicaBasicTest with failure: testcase_0101 ~ testcases_0110

          B. Adding features:

          B.1. Support multi topics. Currently, ProducerPerformance doesn't support multiple topics. This multi topics testing is achieved by creating one thread of ProducerPerformance on each topic and let them run in the background.

          B.2. Support cluster_config.json at the following levels:

          B.2.1. global level : system_test/cluster_config.json

          This global level of cluster_config.json should always exist by default and the config will be saved in the list “SystemTestEnv.clusterEntityConfigDictListInSystemTestLevel”

          B.2.2. testsuite level : system_test/xxxx_testsuite/cluster_config.json

          The following function:
          system_test_utils.setup_remote_hosts_with_testsuite_level_cluster_config

          will look up if there is a testsuite level cluster config file. If it exists, the config will be saved in the list “SystemTestEnv.clusterEntityConfigDictListLastFoundInTestSuite” and this config will override the system test level.

          B.2.3. testcase level : system_test/xxxx_testsuite/testcase_xxxx/cluster_config.json

          The following function:
          system_test_utils.setup_remote_hosts_with_testcase_level_cluster_config

          will look up if there is a testcase level cluster config file. If it exists, the config will be saved in the list “SystemTestEnv.clusterEntityConfigDictListLastFoundInTestCase”

          B.2.4. Assuming there is a cluster_config.json in replication_testsuite/testcase_0001, this config will be effective in testcase_0001. If there is no cluster_config.json in testcase_0002, it is going to restore the config found in replication_testsuite. If there is no cluster config in the testsuite level, it is going to restore the config in the system test level (global).

          B.3. Validate checksum matching across all replicas – this is validated by comparing the checksums of the log segment by a combination of topic-partition-offset as a key such as:

          {u'broker-1': {'test_1-0/00000000000000000000.kafka': '3284084293',
          'test_1-0/00000000000000010271.kafka': '853280790',
          'test_1-0/00000000000000020555.kafka': '3106018734',
          . . .
          u'broker-2': {'test_1-0/00000000000000000000.kafka': '3284084293',
          'test_1-0/00000000000000010271.kafka': '853280790',
          'test_1-0/00000000000000020555.kafka': '3106018734',
          . . .
          u'broker-3': {'test_1-0/00000000000000000000.kafka': '3284084293',
          'test_1-0/00000000000000010271.kafka': '853280790',
          'test_1-0/00000000000000020555.kafka': '3106018734',
          . . .

          B.4. User may specify test cases to run in system_test/testcase_to_run.json (KAFKA-503)

          { "MirrorMakerTest" : [ "testcase_0001" ], "ReplicaBasicTest" : [ "testcase_0001", "testcase_0002", ] }

          B.5. User may specify test cases to skip in system_test/testcase_to_skip.json (KAFKA-503)
          Similar to B.4.

          B.6. System Test has a command line argument '-p' to print test case description only (don't run any test)

          To display the testcase description, run: “python -B system_test_runner.py -p”

          B.7. System Test has a command line argument '-n' for not cleaning remote hosts kafka home directories to save time or preserve the previous version binaries

          C. Refactoring:

          C.1. Move these log messages to DEBUG level “Found the log line” & “Unix timestamp”

          C.2. Move logging config in system_test_runner.py to an external file: system_file/logging.conf

          C.3. At the end of each test case, stop ZK at last to avoid hanging brokers (KAFKA-507)

          C.4. In the beginning of each test case, the zookeeper and broker data directories are removed to clean up any hidden files under them (KAFKA-526)

          C.5. In the test suite class (eg. ReplicaBasicTest) move the logic of setup and initialization to base classes (ReplicationUtils & SetupUtils) (KAFKA-502)

          Show
          John Fung added a comment - - edited kafka-502-v1.patch This patch contains changes for the following JIRA: KAFKA- [441-448] KAFKA-488 KAFKA-502 KAFKA-503 KAFKA-507 KAFKA-526 A. Adding ReplicaBasicTest test cases to cover KAFKA- [441~448] . There are 10 cases with failure and 10 cases without failure. These 10 cases are running in the same ReplicaBasicTest class with varying parameters to get different combinations of arguments such as compression, request-num-acks in Producer, no. of partition when creating topic, broker log segment size, ... etc. A.1. Add 10 test cases for ReplicaBasicTest without failure: testcase_0001 ~ testcase_0010 A.2. Add 10 test cases for ReplicaBasicTest with failure: testcase_0101 ~ testcases_0110 B. Adding features: B.1. Support multi topics. Currently, ProducerPerformance doesn't support multiple topics. This multi topics testing is achieved by creating one thread of ProducerPerformance on each topic and let them run in the background. B.2. Support cluster_config.json at the following levels: B.2.1. global level : system_test/cluster_config.json This global level of cluster_config.json should always exist by default and the config will be saved in the list “SystemTestEnv.clusterEntityConfigDictListInSystemTestLevel” B.2.2. testsuite level : system_test/xxxx_testsuite/cluster_config.json The following function: system_test_utils.setup_remote_hosts_with_testsuite_level_cluster_config will look up if there is a testsuite level cluster config file. If it exists, the config will be saved in the list “SystemTestEnv.clusterEntityConfigDictListLastFoundInTestSuite” and this config will override the system test level. B.2.3. testcase level : system_test/xxxx_testsuite/testcase_xxxx/cluster_config.json The following function: system_test_utils.setup_remote_hosts_with_testcase_level_cluster_config will look up if there is a testcase level cluster config file. If it exists, the config will be saved in the list “SystemTestEnv.clusterEntityConfigDictListLastFoundInTestCase” B.2.4. Assuming there is a cluster_config.json in replication_testsuite/testcase_0001, this config will be effective in testcase_0001. If there is no cluster_config.json in testcase_0002, it is going to restore the config found in replication_testsuite. If there is no cluster config in the testsuite level, it is going to restore the config in the system test level (global). B.3. Validate checksum matching across all replicas – this is validated by comparing the checksums of the log segment by a combination of topic-partition-offset as a key such as: {u'broker-1': {'test_1-0/00000000000000000000.kafka': '3284084293', 'test_1-0/00000000000000010271.kafka': '853280790', 'test_1-0/00000000000000020555.kafka': '3106018734', . . . u'broker-2': {'test_1-0/00000000000000000000.kafka': '3284084293', 'test_1-0/00000000000000010271.kafka': '853280790', 'test_1-0/00000000000000020555.kafka': '3106018734', . . . u'broker-3': {'test_1-0/00000000000000000000.kafka': '3284084293', 'test_1-0/00000000000000010271.kafka': '853280790', 'test_1-0/00000000000000020555.kafka': '3106018734', . . . B.4. User may specify test cases to run in system_test/testcase_to_run.json ( KAFKA-503 ) { "MirrorMakerTest" : [ "testcase_0001" ], "ReplicaBasicTest" : [ "testcase_0001", "testcase_0002", ] } B.5. User may specify test cases to skip in system_test/testcase_to_skip.json ( KAFKA-503 ) Similar to B.4. B.6. System Test has a command line argument '-p' to print test case description only (don't run any test) To display the testcase description, run: “python -B system_test_runner.py -p” B.7. System Test has a command line argument '-n' for not cleaning remote hosts kafka home directories to save time or preserve the previous version binaries C. Refactoring: C.1. Move these log messages to DEBUG level “Found the log line” & “Unix timestamp” C.2. Move logging config in system_test_runner.py to an external file: system_file/logging.conf C.3. At the end of each test case, stop ZK at last to avoid hanging brokers ( KAFKA-507 ) C.4. In the beginning of each test case, the zookeeper and broker data directories are removed to clean up any hidden files under them ( KAFKA-526 ) C.5. In the test suite class (eg. ReplicaBasicTest) move the logic of setup and initialization to base classes (ReplicationUtils & SetupUtils) ( KAFKA-502 )
          John Fung made changes -
          Link This issue contains KAFKA-503 [ KAFKA-503 ]
          John Fung made changes -
          Link This issue contains KAFKA-507 [ KAFKA-507 ]
          John Fung made changes -
          Link This issue contains KAFKA-526 [ KAFKA-526 ]
          John Fung made changes -
          Link This issue contains KAFKA-447 [ KAFKA-447 ]
          John Fung made changes -
          Link This issue contains KAFKA-448 [ KAFKA-448 ]
          John Fung made changes -
          Link This issue contains KAFKA-442 [ KAFKA-442 ]
          John Fung made changes -
          Link This issue contains KAFKA-441 [ KAFKA-441 ]
          John Fung made changes -
          Link This issue contains KAFKA-446 [ KAFKA-446 ]
          John Fung made changes -
          Link This issue contains KAFKA-445 [ KAFKA-445 ]
          John Fung made changes -
          Link This issue contains KAFKA-444 [ KAFKA-444 ]
          John Fung made changes -
          Link This issue contains KAFKA-443 [ KAFKA-443 ]
          John Fung made changes -
          Attachment kafka-502-v1.patch [ 12547058 ]
          John Fung made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          John Fung added a comment - - edited

          Uploaded kafka-502-v2.patch with additional changes on top of v1:

          1. Support multiple Zookeeper cluster
          2. Add a new test suite for Mirror Maker in Replication
          3. If system_test/testcase_to_run.json exists, the test framework will run the test case specified in it. After installation, 1 test case "ReplicaBasicTest.testcase_0001" will run in localhost by default.

          Show
          John Fung added a comment - - edited Uploaded kafka-502-v2.patch with additional changes on top of v1: 1. Support multiple Zookeeper cluster 2. Add a new test suite for Mirror Maker in Replication 3. If system_test/testcase_to_run.json exists, the test framework will run the test case specified in it. After installation, 1 test case "ReplicaBasicTest.testcase_0001" will run in localhost by default.
          John Fung made changes -
          Attachment kafka-502-v2.patch [ 12547244 ]
          John Fung made changes -
          Link This issue contains KAFKA-488 [ KAFKA-488 ]
          Hide
          John Fung added a comment -

          Uploaded kafka-502-v3.patch with additional changes on top of kafka-502-v2.patch:

          Added a shell script "run_sanity.sh" to run the existing test case (first checked in under KAFKA-440):
          system_test/replication_testsuite/testcase_1

          Show
          John Fung added a comment - Uploaded kafka-502-v3.patch with additional changes on top of kafka-502-v2.patch: Added a shell script "run_sanity.sh" to run the existing test case (first checked in under KAFKA-440 ): system_test/replication_testsuite/testcase_1
          John Fung made changes -
          Attachment kafka-502-v3.patch [ 12547283 ]
          Hide
          Jun Rao added a comment -

          Thanks for patch v3. This looks great.

          0. Do all new tests pass right now? How long will it take to run all tests?

          1. The description in each test case can be improved. For example, in the following description, steps 06 and 07 are being done in every test case. It's ok to include the detailed steps in each test case. However, it would be useful to describe what's unique (say compared to testcase_1) about a particular test case at the very beginning of the description. This way, if some test cases fail, we can quickly understand the unique features about them. Also, can we spell out comp as compression?
          + "description": {"01":"To Test : 'Leader Failure in Replication'",
          + "02":"Produce and consume messages to a single topic - single partition.",
          + "03":"This test sends messages to 6 replicas",
          + "04":"To trigger leader election: find the leader and terminate by controlled failure (kill -15)",
          + "05":"Restart the terminated broker",
          + "06":"Lookup brokers' log4j messages and verify that leader is re-elected successfully",
          + "07":"At the end it verifies the log size and contents",
          + "08":"Use a consumer to verify no message loss.",
          + "09":"Producer dimensions : mode:async, acks:-1, comp:1",
          + "10":"Log segment size : 10240"

          2. run_sanity.sh seems to run in debug mode that logs a lot of debugging statements. Is this necessary? Also, in my local box, it has been running for more than 20 mins and hasn't finished. That test used to take less than 6 mins.

          3. Mirror maker testsuite: Do we bounce the leader in both the source and the target clusters? If so, do we bounce them at the same time? Just want to clarify how this works.

          Show
          Jun Rao added a comment - Thanks for patch v3. This looks great. 0. Do all new tests pass right now? How long will it take to run all tests? 1. The description in each test case can be improved. For example, in the following description, steps 06 and 07 are being done in every test case. It's ok to include the detailed steps in each test case. However, it would be useful to describe what's unique (say compared to testcase_1) about a particular test case at the very beginning of the description. This way, if some test cases fail, we can quickly understand the unique features about them. Also, can we spell out comp as compression? + "description": {"01":"To Test : 'Leader Failure in Replication'", + "02":"Produce and consume messages to a single topic - single partition.", + "03":"This test sends messages to 6 replicas", + "04":"To trigger leader election: find the leader and terminate by controlled failure (kill -15)", + "05":"Restart the terminated broker", + "06":"Lookup brokers' log4j messages and verify that leader is re-elected successfully", + "07":"At the end it verifies the log size and contents", + "08":"Use a consumer to verify no message loss.", + "09":"Producer dimensions : mode:async, acks:-1, comp:1", + "10":"Log segment size : 10240" 2. run_sanity.sh seems to run in debug mode that logs a lot of debugging statements. Is this necessary? Also, in my local box, it has been running for more than 20 mins and hasn't finished. That test used to take less than 6 mins. 3. Mirror maker testsuite: Do we bounce the leader in both the source and the target clusters? If so, do we bounce them at the same time? Just want to clarify how this works.
          Hide
          John Fung added a comment -

          Thank you Jun for reviewing the patch. Uploaded kafka-502-v4.patch with the changes suggested:

          1. The Test Description is now updated with additional info about the change from a previous test case.

          2. I couldn't reproduce the hanging problem you ran into. Please try this patch and see if it works for you.

          3. There is a basic Mirror Maker testcase in this patch but it doesn't introduce failure (bouncing). Failure cases will be added soon.

          Show
          John Fung added a comment - Thank you Jun for reviewing the patch. Uploaded kafka-502-v4.patch with the changes suggested: 1. The Test Description is now updated with additional info about the change from a previous test case. 2. I couldn't reproduce the hanging problem you ran into. Please try this patch and see if it works for you. 3. There is a basic Mirror Maker testcase in this patch but it doesn't introduce failure (bouncing). Failure cases will be added soon.
          John Fung made changes -
          Attachment kafka-502-v4.patch [ 12547846 ]
          Hide
          Jun Rao added a comment - - edited

          Thanks for patch v4. A comment and a couple of questions:

          40. In the test case description, sometimes we use mode => sync and some other times we use "mode is changed to sync". Let's make it consistent. I think mode => sync is better.

          41. Do all new tests pass right now? How long will it take to run all tests?

          42. run_sanity.sh does run now. It seems to take more than 10 minutes now, though it used to be less than 6 minutes. Is this expected?

          Show
          Jun Rao added a comment - - edited Thanks for patch v4. A comment and a couple of questions: 40. In the test case description, sometimes we use mode => sync and some other times we use "mode is changed to sync". Let's make it consistent. I think mode => sync is better. 41. Do all new tests pass right now? How long will it take to run all tests? 42. run_sanity.sh does run now. It seems to take more than 10 minutes now, though it used to be less than 6 minutes. Is this expected?
          John Fung made changes -
          Attachment kafka-502-v5.patch [ 12548003 ]
          Hide
          John Fung added a comment -

          Thanks Jun for reviewing. Uploaded kafka-v5.patch with the consistent test case descriptions: mode => sync

          40. testcase description : mode => sync (updated for all testcase_xxxx.json files)

          41. New cases status:
          testcase_0001 ~ 0010, testcase_0021 ~ 0023 (all passed)
          testcase_0101 ~ 0102 (failed)
          testcase_0103 (passed)
          testcase_0104 ~ 0110 (failed)
          testcase_0111 ~ 0114 (failed)
          testcase_0115 ~ 0116 (passed)
          testcase_0117 ~ 0118 (failed)
          testcase_0121 ~ 0123 (failed)

          42. run_sanity is running the first existing testcase_1. Originally it is testing a 3-broker cluster. Now is changed to a 6-broker cluster and that's why it takes longer to wait for the cluster to be ready.

          Show
          John Fung added a comment - Thanks Jun for reviewing. Uploaded kafka-v5.patch with the consistent test case descriptions: mode => sync 40. testcase description : mode => sync (updated for all testcase_xxxx.json files) 41. New cases status: testcase_0001 ~ 0010, testcase_0021 ~ 0023 (all passed) testcase_0101 ~ 0102 (failed) testcase_0103 (passed) testcase_0104 ~ 0110 (failed) testcase_0111 ~ 0114 (failed) testcase_0115 ~ 0116 (passed) testcase_0117 ~ 0118 (failed) testcase_0121 ~ 0123 (failed) 42. run_sanity is running the first existing testcase_1. Originally it is testing a 3-broker cluster. Now is changed to a 6-broker cluster and that's why it takes longer to wait for the cluster to be ready.
          John Fung made changes -
          Attachment kafka-502-v6.patch [ 12548031 ]
          Hide
          John Fung added a comment -

          Uploaded kafka-v6.patch with the following change:

          51. Added a new cluster_config.json to replication_testsuite/testcase_1/ for 3-broker cluster testing. This is the testcase to run under "run_sanity.sh". Therefore, the sanity test will take around 6 min to finish.

          Thanks.

          Show
          John Fung added a comment - Uploaded kafka-v6.patch with the following change: 51. Added a new cluster_config.json to replication_testsuite/testcase_1/ for 3-broker cluster testing. This is the testcase to run under "run_sanity.sh". Therefore, the sanity test will take around 6 min to finish. Thanks.
          Hide
          Jun Rao added a comment -

          Thanks for patch v6. A few more comments:

          60. Most of the new test cases use replication factor 6. I am not sure if this is necessary. The typical replication factor is 3. While we should cover the case with higher replication factors, most tests should probably be using replication factor 3.

          61. The description of the changes in some of the test cases is not complete. The following are some examples:
          0003 and up: missed the change comp => 1
          0005: missed the change mode => async
          0006: incorrectly marked the change mode=>sync; the test seems to be the same as 0001
          0008: incorrectly marked the change mode=>sync;
          0009: missed the change acks => 1

          Show
          Jun Rao added a comment - Thanks for patch v6. A few more comments: 60. Most of the new test cases use replication factor 6. I am not sure if this is necessary. The typical replication factor is 3. While we should cover the case with higher replication factors, most tests should probably be using replication factor 3. 61. The description of the changes in some of the test cases is not complete. The following are some examples: 0003 and up: missed the change comp => 1 0005: missed the change mode => async 0006: incorrectly marked the change mode=>sync; the test seems to be the same as 0001 0008: incorrectly marked the change mode=>sync; 0009: missed the change acks => 1
          John Fung made changes -
          Attachment kafka-502-v7.patch [ 12548266 ]
          Hide
          John Fung added a comment - - edited

          Thanks Jun for reviewing. Uploaded kafka-502-v7.patch with the following changes:

          60. Now all cases use replication factor 3 except testcases_002[1-3] and testcases_012[1-3] (which use replication factor 6)

          61. All cases description are now updated with the changes specified in the first description line relative to their corresponding "Base Test".

          71. All of the test cases are also updated with no. of messages changed from 500 => 100 per ProducerPerformance call such that the test will take less time to run.

          72. In testcase_011[1-8], I have reduced the no. of leader bouncing from 5 => 3 to reduce the time to run the tests.

          Show
          John Fung added a comment - - edited Thanks Jun for reviewing. Uploaded kafka-502-v7.patch with the following changes: 60. Now all cases use replication factor 3 except testcases_002 [1-3] and testcases_012 [1-3] (which use replication factor 6) 61. All cases description are now updated with the changes specified in the first description line relative to their corresponding "Base Test". 71. All of the test cases are also updated with no. of messages changed from 500 => 100 per ProducerPerformance call such that the test will take less time to run. 72. In testcase_011 [1-8] , I have reduced the no. of leader bouncing from 5 => 3 to reduce the time to run the tests.
          John Fung made changes -
          Attachment kafka-502-v8.patch [ 12548330 ]
          Hide
          John Fung added a comment -

          Uploaded kafka-502-v8.patch to change the following cases to use replication factor 3 to reduce the overall running time:
          testcase_002[1~3]
          testcase_012[1~3]

          At this point, all testcases are using 3-broker cluster for testing.

          Show
          John Fung added a comment - Uploaded kafka-502-v8.patch to change the following cases to use replication factor 3 to reduce the overall running time: testcase_002 [1~3] testcase_012 [1~3] At this point, all testcases are using 3-broker cluster for testing.
          John Fung made changes -
          Attachment kafka-502-v9.patch [ 12548441 ]
          Hide
          John Fung added a comment -

          Uploaded kafka-502-v9.patch for the following change:

          81. As the data log segment naming style has changed from *.kafka => *.log & *.index, system test's log file checksum validation is also needed to change.

          Show
          John Fung added a comment - Uploaded kafka-502-v9.patch for the following change: 81. As the data log segment naming style has changed from *.kafka => *.log & *.index, system test's log file checksum validation is also needed to change.
          Hide
          Jay Kreps added a comment -

          Sorry about that, should have given a heads up on that change. Didn't think about the impact on the system test.

          Show
          Jay Kreps added a comment - Sorry about that, should have given a heads up on that change. Didn't think about the impact on the system test.
          John Fung made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              John Fung
              Reporter:
              John Fung
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development