[HBASE-20723] Custom hbase.wal.dir results in data loss because we write recovered edits into a different place than where the recovering region server looks for them - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.4.0, 1.4.1, 1.4.2, 1.4.3, 1.4.4, 2.0.0
Fix Version/s: 3.0.0-alpha-1, 2.1.0, 1.3.3, 1.4.6
Component/s: Recovery, wal
Labels:
- s3

Hadoop Flags:

Reviewed
Release Note:

Hide
Previously custom hbase.wal.dir, which is not on the same filesystem as hbase rootdir, results in data loss because we write recovered edits into a different place than where the recovering region server looks for them. This is fixed by retrieving actual recovered edits from the correct location.

Show
Previously custom hbase.wal.dir, which is not on the same filesystem as hbase rootdir, results in data loss because we write recovered edits into a different place than where the recovering region server looks for them. This is fixed by retrieving actual recovered edits from the correct location.

Description

Description:

When custom hbase.wal.dir is configured the recovery system uses it in place of the HBase root dir and thus constructs an incorrect path for recovered edits when splitting WALs. This causes the recovery code in Region Servers to believe there are no recovered edits to replay, which causes a loss of writes that had not flushed prior to loss of a server.

Reproduction:

This is an Azure HDInsight HBase cluster with HDP 2.6. and HBase 1.1.2.2.6.3.2-14

By default the underlying data is going to wasb://xxxxx@yyyyy/hbase
I tried to move WAL folders to HDFS, which is the SSD mounted on each VM at /mnt.

hbase.wal.dir= hdfs://mycluster/walontest

hbase.wal.dir.perms=700

hbase.rootdir.perms=700

hbase.rootdir= wasb://XYZ@hbaseperf.core.net/hbase

Procedure to reproduce this issue:

1. create a table in hbase shell

2. insert a row in hbase shell

3. reboot the VM which hosts that region

4. scan the table in hbase shell and it is empty

Looking at the region server logs:

2018-06-12 22:08:40,455 INFO  [RS_LOG_REPLAY_OPS-wn2-duohba:16020-0-Writer-1] wal.WALSplitter: This region's directory doesn't exist: hdfs://mycluster/walontest/data/default/tb1/b7fd7db5694eb71190955292b3ff7648. It is very likely that it was already split so it's safe to discard those edits.

The log split/replay ignored actual WAL due to WALSplitter is looking for the region directory in the hbase.wal.dir we specified rather than the hbase.rootdir.

Looking at the source code,
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WALSplitter.java it uses the rootDir, which is walDir, as the tableDir root path.

So if we use ~~HBASE-17437~~, waldir and hbase rootdir are in different path or even in different filesystem, then the #5 uses walDir as tableDir is apparently wrong.

CC: zyork, yuzhihong@gmail.com Attached the logs for quick review.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

20723.branch-2.txt
16/Jun/18 02:49
12 kB
Ted Yu
20723.branch-1.txt
15/Jun/18 23:47
27 kB
Ted Yu
20723.v10.txt
15/Jun/18 22:20
12 kB
Ted Yu
20723.v9.txt
15/Jun/18 00:27
8 kB
Ted Yu
20723.v8.txt
14/Jun/18 22:30
7 kB
Ted Yu
20723.v7.txt
14/Jun/18 21:50
5 kB
Ted Yu
20723.v6.txt
14/Jun/18 19:42
4 kB
Ted Yu
20723.v5.txt
14/Jun/18 15:42
4 kB
Ted Yu
20723.v5.txt
14/Jun/18 09:49
4 kB
Ted Yu
20723.v4.txt
14/Jun/18 09:28
3 kB
Ted Yu
20723.v3.txt
14/Jun/18 05:03
3 kB
Ted Yu
20723.v2.txt
14/Jun/18 03:20
2 kB
Ted Yu
20723.v1.txt
13/Jun/18 23:36
1 kB
Ted Yu
logs.zip
13/Jun/18 01:30
511 kB
Rohan Pednekar

Issue Links

is related to

HBASE-20734 Colocate recovered edits directory with hbase.wal.dir

Closed

Activity

No work has yet been logged on this issue.

People

Assignee:: Ted Yu

Reporter:: Rohan Pednekar

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 13/Jun/18 01:47

Updated:: 16/Dec/24 07:15

Resolved:: 17/Jun/18 13:21