[HBASE-23693] Split failure may cause region hole and data loss when use zk assign - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.4.8
Fix Version/s: 1.6.0
Component/s: master
Labels:
None

Tags:
split

Description

to mock this case, I add a sleep code in SplitTransactionImpl.excute after the PONR and before openDaughters:

public PairOfSameType<Region> execute(final Server server,
      final RegionServerServices services, User user) throws IOException {
    this.server = server;
    this.rsServices = services;
    useZKForAssignment = server == null ? true :
      ConfigUtil.useZKForAssignment(server.getConfiguration());
    if (useCoordinatedStateManager(server)) {
      std =
          ((BaseCoordinatedStateManager) server.getCoordinatedStateManager())
              .getSplitTransactionCoordination().getDefaultDetails();
    }
    PairOfSameType<Region> regions = createDaughters(server, services, user);
    if (this.parent.getCoprocessorHost() != null) {
      if (user == null) {
        parent.getCoprocessorHost().preSplitAfterPONR();
      } else {
        try {
          user.getUGI().doAs(new PrivilegedExceptionAction<Void>() {
            @Override
            public Void run() throws Exception {
              parent.getCoprocessorHost().preSplitAfterPONR();
              return null;
            }
          });
        } catch (InterruptedException ie) {
          InterruptedIOException iioe = new InterruptedIOException();
          iioe.initCause(ie);
          throw iioe;
        }
      }
    }
    
    //sleep here!!!
    try {
      Thread.sleep(1000 * 60 * 60);
    } catch (InterruptedException e) {
      e.printStackTrace();
    }

    regions = stepsAfterPONR(server, services, regions, user);

    transition(SplitTransactionPhase.COMPLETED);

    return regions;
  }

so the split transaction will hang.

then i try to reproduce this problem:

1.Create a test table and move it into a test rsgroup, there is only 1 RS in the test group

2.Trigger a region split

3.The split transaction step after the PONR and sleep, regioninfo in meta has been updated

4.Kill the RS process to mock machine crash

5.ServerCrashProcedure cleanup SPLITING_NEW region, the daughter regions will be deleted

6.ServerCrashProcedure try to assign the parent region, because RS is down and assign fails, the region status is set to FAILED_OPEN and put back into regionsInTransition. But at this time, due to RS crash, the node of the region under ZK region-in-transition no longer exist

7.CatalogJanitor thread is blocked due to RIT

8.Switch active master

9.The CatalogJanitor thread on the new master executes normally and the parent region is cleaned up because split = true && offline = true in the meta table

10.We have a hole in the test table and loss data.

I modified the code when ServerCrashProcedure cleans up the child regions, it will update the parent regioninfo in the meta table, and this problem is no longer reproduced.

I will upload the patch later.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-23693.branch-1.001.patch
19/Jan/20 03:16
8 kB
tianhang tang

Issue Links

links to

GitHub Pull Request #1070

GitHub Pull Request #1071

Activity

People

Assignee:: tianhang tang

Reporter:: tianhang tang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Jan/20 12:15

Updated:: 06/Mar/20 23:43

Resolved:: 11/Feb/20 15:36