We need a nice way of handling long network partitions without impacting a master cluster (which pushes the data). Currently it will just retry over and over again.
I think we could:
- Stop replication to a slave cluster if it didn't respond for more than 10 minutes
- Keep track of the duration of the partition
- When the slave cluster comes back, initiate a MR job like
Maybe we want less than 10 minutes, maybe we want this to be all automatic or just the first 2 parts. Discuss.