Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-3280

Master fails to access replicated log after network partition

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.26.0
    • Component/s: master, replicated log
    • Labels:
    • Environment:

      Zookeeper version 3.4.5--1

    • Story Points:
      8

      Description

      In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a network partition is forced, all the masters apparently lose access to their replicated log. The leading master halts. Unknown reasons, but presumably related to replicated log access. The others fail to recover from the replicated log. Unknown reasons. This could have to do with ZK setup, but it might also be a Mesos bug.

      This was observed in a Chronos test drive scenario described in detail here:
      https://github.com/mesos/chronos/issues/511

      With setup instructions here:
      https://github.com/mesos/chronos/issues/508

        Attachments

        1. rep-log-race-cond-logs.tar.gz
          20 kB
          Neil Conway
        2. rep-log-startup-race-test-1.patch
          4 kB
          Neil Conway

          Issue Links

            Activity

              People

              • Assignee:
                neilc Neil Conway
                Reporter:
                bernd-mesos Bernd Mathiske
                Shepherd:
                Joris Van Remoortere
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: