Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      With the async wal approach (HBASE-11568), the edits are not persisted (to wal) in the secondary region replicas. However this means that we have to deal with secondary region replica failures.

      We can seek to re-replicate the edits from primary to the secondary when the secondary region is opened in another server but this would mean to setup a replication queue again, and holding on to the wals for longer.

      Instead, we can design it so that the edits form the secondaries are not persisted to wal, and if the secondary replica fails over, it will not start serving reads until it has guaranteed that it has all the past data.

      For guaranteeing that the secondary replica has all the edits before serving reads, we can use flush and region opening markers. Whenever a region open event is seen, it writes all the files at the time of opening to wal (HBASE-11512). In case of flush, the flushed file is written as well, and the secondary replica can do a ls for the store files and pick up all the files before the seqId of the flushed file. So, in this design, the secodary replica will wait until it sees and replays a flush or region open marker from wal from primary. and then start serving. For speeding up replica opening time, we can trigger a flush to the primary whenever the secondary replica opens as an optimization.

        Issue Links

          Activity

          Hide
          Enis Soztutar added a comment -

          I have put up a patch for review for this. The RB description goes into some detail what the patch contains.

          Show
          Enis Soztutar added a comment - I have put up a patch for review for this. The RB description goes into some detail what the patch contains.
          Hide
          Enis Soztutar added a comment -

          Would this be on by default or would this be an option? It seems that in a failure situation we would hit a "flush amplification" problem which seems concerning

          I think we may want to turn this on by default, but we'll have to have some operational experience with the setting. You are right that there is a sync flush concern, but w/o making secondary replica edits durable in wal, there seems no other way other than waiting for a flush to start serving data from memstore.

          We can also add some jitter as well to cope with this.

          Show
          Enis Soztutar added a comment - Would this be on by default or would this be an option? It seems that in a failure situation we would hit a "flush amplification" problem which seems concerning I think we may want to turn this on by default, but we'll have to have some operational experience with the setting. You are right that there is a sync flush concern, but w/o making secondary replica edits durable in wal, there seems no other way other than waiting for a flush to start serving data from memstore. We can also add some jitter as well to cope with this.
          Hide
          Jonathan Hsieh added a comment -

          As an optimization so that the secondary will sooner start serving data, the region server opening a secondary region replica will make an RPC to the RS serving the primary for a flush. This will result either a flush, which will asynchronously be replayed to the secondary, or a no-op because memstores are empty. In the latter case, we should still drop a WAL marker though.

          Would this be on by default or would this be an option? It seems that in a failure situation we would hit a "flush amplification" problem which seems concerning.

          Let's say we have a 20 node cluster with each rs having 25 regions on them. If those regions have 2 secondaries, each hosts would have 50 secondaries. Let's say one rs goes down When these secondaries gets rebalanced, they'd get spread around a bunch likely landing on every or almost every node in the cluster. If each secondary region reopening triggers a flush request we'd end up with a lot of flush requests, all around the same time all across the cluster, no? This seems problematic. We could probably coalesce some of the flushes (each rs receiving secondaries could consolidate all into one flush) but if we had 5 machines go down wouldn't we still get 5x the flushes?

          Show
          Jonathan Hsieh added a comment - As an optimization so that the secondary will sooner start serving data, the region server opening a secondary region replica will make an RPC to the RS serving the primary for a flush. This will result either a flush, which will asynchronously be replayed to the secondary, or a no-op because memstores are empty. In the latter case, we should still drop a WAL marker though. Would this be on by default or would this be an option? It seems that in a failure situation we would hit a "flush amplification" problem which seems concerning. Let's say we have a 20 node cluster with each rs having 25 regions on them. If those regions have 2 secondaries, each hosts would have 50 secondaries. Let's say one rs goes down When these secondaries gets rebalanced, they'd get spread around a bunch likely landing on every or almost every node in the cluster. If each secondary region reopening triggers a flush request we'd end up with a lot of flush requests, all around the same time all across the cluster, no? This seems problematic. We could probably coalesce some of the flushes (each rs receiving secondaries could consolidate all into one flush) but if we had 5 machines go down wouldn't we still get 5x the flushes?
          Hide
          Enis Soztutar added a comment -

          Sorry I should have missed that one.

          The reason we need special handling in secondary region replicas is that the edits are replayed to secondaries, but the WAL is skipped in secondary replicas. This is because we do not want to cause additional WAL writes for secondaries.

          However, this implies that when the secondary region replica starts serving, it cannot make sure that it has all the memstore updates because some of it may have been missed after the last flush. So, what I am thinking is a mechanism with two parts:

          • Whenever a secondary is opened, it will reject read requests until it makes sure (on its own) that is has all the updates up to a seqId. The seqId has to be greater than the last flush seqId of the files since a previous open of the same region replica might have replayed some in-memory data, and we do not want the seqId served from a particular replica_id to go back. At this state, the secondary will watch for flush start and commit events from WAL replay. The secondary will ignore all replayed edits before seeing a flush start record. After seeing flush start, it will apply all edits to memstores. Once the secondary sees the corresponding flush commit, it can pick up the new flushed file, and start serving from all flushed files and its memstore data.
          • As an optimization so that the secondary will sooner start serving data, the region server opening a secondary region replica will make an RPC to the RS serving the primary for a flush. This will result either a flush, which will asynchronously be replayed to the secondary, or a no-op because memstores are empty. In the latter case, we should still drop a WAL marker though.
          Show
          Enis Soztutar added a comment - Sorry I should have missed that one. The reason we need special handling in secondary region replicas is that the edits are replayed to secondaries, but the WAL is skipped in secondary replicas. This is because we do not want to cause additional WAL writes for secondaries. However, this implies that when the secondary region replica starts serving, it cannot make sure that it has all the memstore updates because some of it may have been missed after the last flush. So, what I am thinking is a mechanism with two parts: Whenever a secondary is opened, it will reject read requests until it makes sure (on its own) that is has all the updates up to a seqId. The seqId has to be greater than the last flush seqId of the files since a previous open of the same region replica might have replayed some in-memory data, and we do not want the seqId served from a particular replica_id to go back. At this state, the secondary will watch for flush start and commit events from WAL replay. The secondary will ignore all replayed edits before seeing a flush start record. After seeing flush start, it will apply all edits to memstores. Once the secondary sees the corresponding flush commit, it can pick up the new flushed file, and start serving from all flushed files and its memstore data. As an optimization so that the secondary will sooner start serving data, the region server opening a secondary region replica will make an RPC to the RS serving the primary for a flush. This will result either a flush, which will asynchronously be replayed to the secondary, or a no-op because memstores are empty. In the latter case, we should still drop a WAL marker though.
          Hide
          Jonathan Hsieh added a comment -

          This wasn't answered in HBASE-11183

          enis: Whenever a secondary starts serving, it will trigger a flush from the primary region ..
          stack: How will this work?

          Do you have a sense on what the policy will be? can you fill in more details about how this would work?

          Show
          Jonathan Hsieh added a comment - This wasn't answered in HBASE-11183 enis: Whenever a secondary starts serving, it will trigger a flush from the primary region .. stack: How will this work? Do you have a sense on what the policy will be? can you fill in more details about how this would work?

            People

            • Assignee:
              Enis Soztutar
              Reporter:
              Enis Soztutar
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development