Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26768

Avoid unnecessary replication suspending in RegionReplicationSink

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0-alpha-2
    • 3.0.0-alpha-3
    • read replicas
    • None

    Description

      It seems that the problem HBASE-26449 described still exists in following RegionReplicationSink.onComplete,which is running in Netty's nioEventLoop. Assuming we have only one secondary replica, first we add the replica to the failed because a failure of replicating in following line 228, but before we enter the line 238, the flusher thread calls RegionReplicationSink.add and we clear the RegionReplicationSink.failedReplicas due to a flush all edit. When the Netty nioEventLoop continues to enter line 238, we still add a replica to the failedReplicas even though the maxSequenceId < lastFlushedSequenceId.

      207 private void onComplete(List<SinkEntry> sent,
      208         Map<Integer, MutableObject<Throwable>> replica2Error) {
                            ....
      217        Set<Integer> failed = new HashSet<>();
      218        for (Map.Entry<Integer, MutableObject<Throwable>> entry : replica2Error.entrySet()) {
      219        Integer replicaId = entry.getKey();
      220       Throwable error = entry.getValue().getValue();
      221        if (error != null) {
      222           if (maxSequenceId > lastFlushedSequenceId) {
                           ...
      228             failed.add(replicaId);
      229           } else {
                         ......
      
      238       synchronized (entries) {
      239           pendingSize -= toReleaseSize;
      240           if (!failed.isEmpty()) {
      241                failedReplicas.addAll(failed);
      242                flushRequester.requestFlush(maxSequenceId);
      243           }
                      ......
      253      }
      254    }
      

      What is worse, when we invoke RegionReplicationFlushRequester.requestFlush, the flushing may be skipped because in following RegionReplicationFlushRequester.flush, pendingFlushRequestSequenceId is less than lastFlushedSequenceId, so the only secondary replica is marked failed and requested flushing is skipped, the replication may suspend until next memstore flush :

      private synchronized void flush(Timeout timeout) {
          pendingFlushRequest = null;
          if (pendingFlushRequestSequenceId >= lastFlushedSequenceId) {
            request();
          }
        }
      

      I simulate this problem in the PR and my fix is double check the if (maxSequenceId > lastFlushedSequenceId) in the synchronized block in RegionReplicationSink.onComplete.

      Attachments

        Issue Links

          Activity

            People

              comnetwork chenglei
              comnetwork chenglei
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: