Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0-alpha-2
-
None
Description
It seems that the problem HBASE-26449 described still exists in following RegionReplicationSink.onComplete,which is running in Netty's nioEventLoop. Assuming we have only one secondary replica, first we add the replica to the failed because a failure of replicating in following line 228, but before we enter the line 238, the flusher thread calls RegionReplicationSink.add and we clear the RegionReplicationSink.failedReplicas due to a flush all edit. When the Netty nioEventLoop continues to enter line 238, we still add a replica to the failedReplicas even though the maxSequenceId < lastFlushedSequenceId.
207 private void onComplete(List<SinkEntry> sent, 208 Map<Integer, MutableObject<Throwable>> replica2Error) { .... 217 Set<Integer> failed = new HashSet<>(); 218 for (Map.Entry<Integer, MutableObject<Throwable>> entry : replica2Error.entrySet()) { 219 Integer replicaId = entry.getKey(); 220 Throwable error = entry.getValue().getValue(); 221 if (error != null) { 222 if (maxSequenceId > lastFlushedSequenceId) { ... 228 failed.add(replicaId); 229 } else { ...... 238 synchronized (entries) { 239 pendingSize -= toReleaseSize; 240 if (!failed.isEmpty()) { 241 failedReplicas.addAll(failed); 242 flushRequester.requestFlush(maxSequenceId); 243 } ...... 253 } 254 }
What is worse, when we invoke RegionReplicationFlushRequester.requestFlush, the flushing may be skipped because in following RegionReplicationFlushRequester.flush, pendingFlushRequestSequenceId is less than lastFlushedSequenceId, so the only secondary replica is marked failed and requested flushing is skipped, the replication may suspend until next memstore flush :
private synchronized void flush(Timeout timeout) { pendingFlushRequest = null; if (pendingFlushRequestSequenceId >= lastFlushedSequenceId) { request(); } }
I simulate this problem in the PR and my fix is double check the if (maxSequenceId > lastFlushedSequenceId) in the synchronized block in RegionReplicationSink.onComplete.
Attachments
Issue Links
- relates to
-
HBASE-26449 The way we add or clear failedReplicas may have race
- Resolved
-
HBASE-26233 The region replication framework should not be built upon the general replication framework
- Resolved
- links to