Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-22539

WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.2.0, 2.0.5, 2.1.5
    • 3.0.0-alpha-1, 2.3.0, 2.0.6, 2.2.1, 2.1.6
    • rpc, wal
    • None
    • Reviewed
    • Hide
      We found a critical bug which can lead to WAL corruption when Durability.ASYNC_WAL is used. The reason is that we release a ByteBuffer before actually persist the content into WAL file.

      The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds when replaying WAL. This is because that the ByteBuffer is reused by others.

      ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
      java.lang.ArrayIndexOutOfBoundsException: 18056
              at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
              at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
              at org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
              at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
              at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
              at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
              at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
              at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
              at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)

      And may even cause segmentation fault and crash the JVM directly. You will see a hs_err_pidXXX.log file and usually the problem is SIGSEGV. This is usually because that the ByteBuffer has already been returned to the OS and used for other purpose.

      The problem has been reported several times in the past and this time Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs so we can find the root cause. And Lijin Bin figured out that the problem may only happen when Durability.ASYNC_WAL is used. Thanks to them.

      The problem only effects the 2.x releases, all users are highly recommand to upgrade to a release which has this fix in, especially that if you use Durability.ASYNC_WAL.
      Show
      We found a critical bug which can lead to WAL corruption when Durability.ASYNC_WAL is used. The reason is that we release a ByteBuffer before actually persist the content into WAL file. The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds when replaying WAL. This is because that the ByteBuffer is reused by others. ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY java.lang.ArrayIndexOutOfBoundsException: 18056         at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)         at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)         at org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)         at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)         at org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)         at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)         at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)         at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)         at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100) And may even cause segmentation fault and crash the JVM directly. You will see a hs_err_pidXXX.log file and usually the problem is SIGSEGV. This is usually because that the ByteBuffer has already been returned to the OS and used for other purpose. The problem has been reported several times in the past and this time Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs so we can find the root cause. And Lijin Bin figured out that the problem may only happen when Durability.ASYNC_WAL is used. Thanks to them. The problem only effects the 2.x releases, all users are highly recommand to upgrade to a release which has this fix in, especially that if you use Durability.ASYNC_WAL.

    Description

      Summary

      We had been chasing a WAL corruption issue reported on one of our customers deployments running release 2.1.1 (CDH 6.1.0). After providing a custom modified jar with the extra sanity checks implemented by HBASE-21401 applied on some code points, plus additional debugging messages, we believe it is related to DirectByteBuffer usage, and Unsafe copy from offheap memory to on-heap array triggered here, such as when writing into a non ByteBufferWriter type, as done here.

      More details on the following comment.

       

      Attachments

        1. HBASE-22539-UT.patch
          10 kB
          Duo Zhang
        2. HBASE-22539.branch-2.001.patch
          15 kB
          Wellington Chevreuil

        Issue Links

          Activity

            People

              zhangduo Duo Zhang
              wchevreuil Wellington Chevreuil
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: