[HBASE-18137] Replication gets stuck for empty WALs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.3.1
Fix Version/s: 1.4.0, 1.3.2, 2.0.0, 1.2.7
Component/s: Replication
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
0-length WAL files can potentially cause the replication queue to get stuck. A new config "replication.source.eof.autorecovery" has been added: if set to true (default is false), the 0-length WAL file will be skipped after 1) the max number of retries has been hit, and 2) there are more WAL files in the queue. The risk of enabling this is that there is a chance the 0-length WAL file actually has some data (e.g. block went missing and will come back once a datanode is recovered).

Show
0-length WAL files can potentially cause the replication queue to get stuck. A new config "replication.source.eof.autorecovery" has been added: if set to true (default is false), the 0-length WAL file will be skipped after 1) the max number of retries has been hit, and 2) there are more WAL files in the queue. The risk of enabling this is that there is a chance the 0-length WAL file actually has some data (e.g. block went missing and will come back once a datanode is recovered).

Description

Replication assumes that only the last WAL of a recovered queue can be empty. But, intermittent DFS issues may cause empty WALs being created (without the PWAL magic), and a roll of WAL to happen without a regionserver crash. This will cause recovered queues to have empty WALs in the middle. This cause replication to get stuck:

TRACE regionserver.ReplicationSource: Opening log <wal_file>
WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue> Got: 
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:197)
	at java.io.DataInputStream.readFully(DataInputStream.java:169)
	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)

The WAL in question was completely empty but there were other WALs in the recovered queue which were newer and non-empty.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-18137.master.v1.patch
10/Jun/17 01:47
10 kB
Vincent Poon
HBASE-18137.branch-1.v2.patch
10/Jun/17 01:38
10 kB
Vincent Poon
HBASE-18137.branch-1.v1.patch
08/Jun/17 23:28
10 kB
Vincent Poon
HBASE-18137.branch-1.3.v3.patch
08/Jun/17 23:12
11 kB
Vincent Poon
HBASE-18137.branch-1.3.v2.patch
08/Jun/17 19:00
9 kB
Vincent Poon
HBASE-18137.branch-1.3.v1.patch
07/Jun/17 22:13
8 kB
Vincent Poon

Issue Links

is duplicated by

HBASE-12830 Unreadable HLogs stuck in replication queues

Closed

relates to

HBASE-12125 Add Hbck option to check and fix WAL's from replication queue

Closed

Activity

People

Assignee:: Vincent Poon

Reporter:: Ashu Pachauri

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 31/May/17 01:30

Updated:: 11/Jan/21 21:56

Resolved:: 10/Jun/17 19:58