[HBASE-26768] Avoid unnecessary replication suspending in RegionReplicationSink - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha-2
Fix Version/s: 3.0.0-alpha-3
Component/s: read replicas
Labels:
None

Description

It seems that the problem ~~HBASE-26449~~ described still exists in following RegionReplicationSink.onComplete,which is running in Netty's nioEventLoop. Assuming we have only one secondary replica, first we add the replica to the failed because a failure of replicating in following line 228, but before we enter the line 238, the flusher thread calls RegionReplicationSink.add and we clear the RegionReplicationSink.failedReplicas due to a flush all edit. When the Netty nioEventLoop continues to enter line 238, we still add a replica to the failedReplicas even though the maxSequenceId < lastFlushedSequenceId.

207 private void onComplete(List<SinkEntry> sent,
208         Map<Integer, MutableObject<Throwable>> replica2Error) {
                      ....
217        Set<Integer> failed = new HashSet<>();
218        for (Map.Entry<Integer, MutableObject<Throwable>> entry : replica2Error.entrySet()) {
219        Integer replicaId = entry.getKey();
220       Throwable error = entry.getValue().getValue();
221        if (error != null) {
222           if (maxSequenceId > lastFlushedSequenceId) {
                     ...
228             failed.add(replicaId);
229           } else {
                   ......

238       synchronized (entries) {
239           pendingSize -= toReleaseSize;
240           if (!failed.isEmpty()) {
241                failedReplicas.addAll(failed);
242                flushRequester.requestFlush(maxSequenceId);
243           }
                ......
253      }
254    }

What is worse, when we invoke RegionReplicationFlushRequester.requestFlush, the flushing may be skipped because in following RegionReplicationFlushRequester.flush, pendingFlushRequestSequenceId is less than lastFlushedSequenceId, so the only secondary replica is marked failed and requested flushing is skipped, the replication may suspend until next memstore flush :

private synchronized void flush(Timeout timeout) {
    pendingFlushRequest = null;
    if (pendingFlushRequestSequenceId >= lastFlushedSequenceId) {
      request();
    }
  }

I simulate this problem in the PR and my fix is double check the if (maxSequenceId > lastFlushedSequenceId) in the synchronized block in RegionReplicationSink.onComplete.

Attachments

Issue Links

relates to

HBASE-26449 The way we add or clear failedReplicas may have race

Resolved

HBASE-26233 The region replication framework should not be built upon the general replication framework

Resolved

links to

GitHub Pull Request #4127

Activity

People

Assignee:: chenglei

Reporter:: chenglei

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Feb/22 13:17

Updated:: 30/Apr/22 03:52

Resolved:: 10/Mar/22 03:01