[HBASE-22539] WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.2.0, 2.0.5, 2.1.5
Fix Version/s: 3.0.0-alpha-1, 2.3.0, 2.0.6, 2.2.1, 2.1.6
Component/s: rpc, wal
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
We found a critical bug which can lead to WAL corruption when Durability.ASYNC_WAL is used. The reason is that we release a ByteBuffer before actually persist the content into WAL file.

The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds when replaying WAL. This is because that the ByteBuffer is reused by others.

ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
java.lang.ArrayIndexOutOfBoundsException: 18056
        at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
        at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
        at org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
        at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
        at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
        at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
        at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
        at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
        at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)

And may even cause segmentation fault and crash the JVM directly. You will see a hs_err_pidXXX.log file and usually the problem is SIGSEGV. This is usually because that the ByteBuffer has already been returned to the OS and used for other purpose.

The problem has been reported several times in the past and this time Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs so we can find the root cause. And Lijin Bin figured out that the problem may only happen when Durability.ASYNC_WAL is used. Thanks to them.

The problem only effects the 2.x releases, all users are highly recommand to upgrade to a release which has this fix in, especially that if you use Durability.ASYNC_WAL.

Show
We found a critical bug which can lead to WAL corruption when Durability.ASYNC_WAL is used. The reason is that we release a ByteBuffer before actually persist the content into WAL file. The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds when replaying WAL. This is because that the ByteBuffer is reused by others. ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY java.lang.ArrayIndexOutOfBoundsException: 18056         at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)         at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)         at org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)         at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)         at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)         at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)         at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)         at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)         at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100) And may even cause segmentation fault and crash the JVM directly. You will see a hs_err_pidXXX.log file and usually the problem is SIGSEGV. This is usually because that the ByteBuffer has already been returned to the OS and used for other purpose. The problem has been reported several times in the past and this time Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs so we can find the root cause. And Lijin Bin figured out that the problem may only happen when Durability.ASYNC_WAL is used. Thanks to them. The problem only effects the 2.x releases, all users are highly recommand to upgrade to a release which has this fix in, especially that if you use Durability.ASYNC_WAL.

Description

Summary

We had been chasing a WAL corruption issue reported on one of our customers deployments running release 2.1.1 (CDH 6.1.0). After providing a custom modified jar with the extra sanity checks implemented by ~~HBASE-21401~~ applied on some code points, plus additional debugging messages, we believe it is related to DirectByteBuffer usage, and Unsafe copy from offheap memory to on-heap array triggered here, such as when writing into a non ByteBufferWriter type, as done here.

More details on the following comment.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-22539.branch-2.001.patch
31/Jul/19 13:21
15 kB
Wellington Chevreuil
HBASE-22539-UT.patch
30/Jul/19 14:37
10 kB
Duo Zhang

Issue Links

is duplicated by

HBASE-22761 Caught ArrayIndexOutOfBoundsException while processing event RS_LOG_REPLAY

Resolved

relates to

HBASE-24984 WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used with multi operation

Resolved

HBASE-23157 WAL unflushed seqId tracking may wrong when Durability.ASYNC_WAL is used

Resolved

HBASE-25701 RegionServer JVM crash when append wal entry

Open

links to

GitHub Pull Request #437

WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates