Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-7823 SCM HA Phase 2
  3. HDDS-4237

Testing Infrastructure Random Failures

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Network partitioning can cause brian-split case where there are two leaders exist. We need some sort of testing Infrastructure/framework to simulate such case and verify whether our SCM HA implementation can achieve strong consistency under partitioned network.

      There might be two ways suggested by Mukul Kumar Singh:

      a) Blockade tests, blockade is a docker based framework where the
      network for one DN can be isolated from the other

      b) MiniOzoneChaosCluster - This is a unit test based test, where a
      random datanode was killed and this helped in finding out issues with
      the consistency.

      We might need similar solution for SCM: block SCM leader network and also increase timeout to make old leader do not turn into candidate.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              amaliujia Rui Wang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: