[HBASE-27231] FSHLog should retry writing WAL entries when syncs to HDFS failed. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha-4
Fix Version/s: 3.0.0-beta-1
Component/s: wal
Labels:
None

Hadoop Flags:

Reviewed

Description

Just as ~~HBASE-27223~~ said, basically, if the WAL write to HDFS fails, we do not know whether the data has been persistent or not. The implementation for AsyncFSWAL, is to open a new writer and try to write the WAL entries again, and then adding logic in WAL split and replay to deal with duplicate entries. But for FSHLog, it does not have the same logic with AsyncFSWAL, when ProtobufLogWriter.append and ProtobufLogWriter.sync failed, FSHLog.sync immediately throws the exception thrown by ProtobufLogWriter.append and ProtobufLogWriter.sync , we should implement the same retry logic as AsyncFSWAL, so WAL.sync could only throw TimeoutIOException and we could uniformly abort the RegionServer when WAL.sync failed.

The basic idea is because both FSHLog.RingBufferEventHandler and AsyncFSWAL.consumeExecutor are single-thread, we could reuse the logic in AsyncWAL and move the most code in AsyncWAL upward to AbstractFSWAL , and just adapting the SyncRunner in FSHLog to the logic in AsyncWriter.sync. Once we do that, most logic in AsyncWAL and FSHLog are unified, just how to sync the writer is different.

Attachments

Issue Links

blocks

HBASE-27970 Make sync replication also work with FSHLog

Resolved

links to

GitHub Pull Request #4721

GitHub Pull Request #5317

Activity

People

Assignee:: chenglei

Reporter:: chenglei

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 21/Jul/22 03:32

Updated:: 12/Sep/23 23:42

Resolved:: 13/Jul/23 13:34