[HBASE-25596] Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2
Component/s: Replication
Labels:
None

Hadoop Flags:

Reviewed

Description

There seems to be a major issue with how we handle the EOF exception from WALEntryStream.

Problem:

When we see EOFException, we try to handle it and remove it from the log queue, but we never try to ship the existing batch of entries. This is a permanent data loss in replication.

Secondly, we do not stop the reader on encountering the EOFException and thus if EOFException was on the last WAL, we still try to process the WALEntry stream and ship the empty batch with lastWALPath set to null. This is the reason of NPE as below which crash the region server.

2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] regionserver.ReplicationSource - Unexpected exception in ReplicationSourceWorkerThread, currentPath=nulljava.lang.NullPointerExceptionat org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16 15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - STOPPED: Unexpected exception in ReplicationSourceWorkerThread

Attachments

Issue Links

links to

GitHub Pull Request #2975

GitHub Pull Request #2987

GitHub Pull Request #2990

GitHub Pull Request #3008

Sub-Tasks

1.	Backport HBASE-25596 and HBASE-25992 to branch-2.3	Open	Anoop Sam John
2.	ReplicationSourceWALReader#run - Reset sleepMultiplier in loop once out of any IOE	Resolved	Unassigned
3.	Polish the ReplicationSourceWALReader code for 2.x after HBASE-25596	Resolved	Duo Zhang

Activity

People

Assignee:: Sandeep Pal

Reporter:: Sandeep Pal

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 22/Feb/21 21:48

Updated:: 02/Jul/21 03:10

Resolved:: 03/Mar/21 19:41