[MESOS-3280] Master fails to access replicated log after network partition - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.23.0
Fix Version/s: 0.26.0
Component/s: master, replicated log
Labels:
- mesosphere
Environment:

Zookeeper version 3.4.5--1

Story Points:
8

Description

In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a network partition is forced, all the masters apparently lose access to their replicated log. The leading master halts. Unknown reasons, but presumably related to replicated log access. The others fail to recover from the replicated log. Unknown reasons. This could have to do with ZK setup, but it might also be a Mesos bug.

This was observed in a Chronos test drive scenario described in detail here:
https://github.com/mesos/chronos/issues/511

With setup instructions here:
https://github.com/mesos/chronos/issues/508

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

rep-log-race-cond-logs.tar.gz
27/Sep/15 00:12
20 kB
Neil Conway
rep-log-startup-race-test-1.patch
27/Sep/15 00:11
4 kB
Neil Conway

Issue Links

is related to

MESOS-3532 3 Master HA setup restarts every 3 minutes

Resolved

relates to

MESOS-1399 Add retries for co-ordinator election.

Accepted

Activity

People

Assignee:: Neil Conway

Reporter:: Bernd Mathiske

Shepherd:: Joris Van Remoortere

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 17/Aug/15 16:11

Updated:: 26/Nov/18 12:20

Resolved:: 26/Nov/18 12:20