[HBASE-18192] Replication drops recovered queues on region server shutdown - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.3.1, 1.2.6
Fix Version/s: 1.4.0, 1.3.2, 2.0.0-alpha-2, 2.0.0, 1.2.7
Component/s: Replication
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
If a region server that is processing recovered queue for another previously dead region server is gracefully shut down, it can drop the recovered queue under certain conditions. Running without this fix on a 1.2+ release means possibility of continuing data loss in replication, irrespective of which WALProvider is used.
If a single WAL group (or DefaultWALProvider) is used, running without this fix will always cause dataloss in replication whenever a region server processing recovered queues is gracefully shutdown.

Show
If a region server that is processing recovered queue for another previously dead region server is gracefully shut down, it can drop the recovered queue under certain conditions. Running without this fix on a 1.2+ release means possibility of continuing data loss in replication, irrespective of which WALProvider is used. If a single WAL group (or DefaultWALProvider) is used, running without this fix will always cause dataloss in replication whenever a region server processing recovered queues is gracefully shutdown.

Description

When a recovered queue has only one active ReplicationWorkerThread, the recovered queue is completely dropped on a region server shutdown. This will happen in situation when
1. DefaultWALProvider is used.
2. RegionGroupingProvider provider is used but replication is stuck on one WAL group for some reason (for example ~~HBASE-18137~~)
3. All other replication workers have died due to unhandled exception, and the only one finishes. This will cause the recovered queue to get deleted without a regionserver shutdown. This can happen on deployments without fix for ~~HBASE-17381~~.

The problematic piece of code is:

while (isWorkerActive()){
        // The worker thread run loop...
}
if (replicationQueueInfo.isQueueRecovered()) {
        // use synchronize to make sure one last thread will clean the queue
        synchronized (workerThreads) {
          Threads.sleep(100);// wait a short while for other worker thread to fully exit
          boolean allOtherTaskDone = true;
          for (ReplicationSourceWorkerThread worker : workerThreads.values()) {
            if (!worker.equals(this) && worker.isAlive()) {
              allOtherTaskDone = false;
              break;
            }
          }
          if (allOtherTaskDone) {
            manager.closeRecoveredQueue(this.source);
            LOG.info("Finished recovering queue " + peerClusterZnode
                + " with the following stats: " + getStats());
          }
        }

The conceptual issue is that isWorkerActive() tells whether a worker is currently running or not and it's being used as a proxy for whether a worker has finished it's work. But, in fact, "Should a worker should exit?" and "Has completed it's work?" are two different questions.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-18192.branch-1.001.patch
10/Jun/17 00:07
13 kB
Ashu Pachauri
HBASE-18192.branch-1.3.003.patch
10/Jun/17 00:07
14 kB
Ashu Pachauri
HBASE-18192.master.001.patch
10/Jun/17 00:07
16 kB
Ashu Pachauri

Issue Links

links to

Review Board (branch-1.3)

Activity

People

Assignee:: Ashu Pachauri

Reporter:: Ashu Pachauri

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 08/Jun/17 06:16

Updated:: 01/Aug/18 06:20

Resolved:: 10/Jun/17 02:59