Description
to mock this case, I add a sleep code in SplitTransactionImpl.excute after the PONR and before openDaughters:
public PairOfSameType<Region> execute(final Server server, final RegionServerServices services, User user) throws IOException { this.server = server; this.rsServices = services; useZKForAssignment = server == null ? true : ConfigUtil.useZKForAssignment(server.getConfiguration()); if (useCoordinatedStateManager(server)) { std = ((BaseCoordinatedStateManager) server.getCoordinatedStateManager()) .getSplitTransactionCoordination().getDefaultDetails(); } PairOfSameType<Region> regions = createDaughters(server, services, user); if (this.parent.getCoprocessorHost() != null) { if (user == null) { parent.getCoprocessorHost().preSplitAfterPONR(); } else { try { user.getUGI().doAs(new PrivilegedExceptionAction<Void>() { @Override public Void run() throws Exception { parent.getCoprocessorHost().preSplitAfterPONR(); return null; } }); } catch (InterruptedException ie) { InterruptedIOException iioe = new InterruptedIOException(); iioe.initCause(ie); throw iioe; } } } //sleep here!!! try { Thread.sleep(1000 * 60 * 60); } catch (InterruptedException e) { e.printStackTrace(); } regions = stepsAfterPONR(server, services, regions, user); transition(SplitTransactionPhase.COMPLETED); return regions; }
so the split transaction will hang.
then i try to reproduce this problem:
1.Create a test table and move it into a test rsgroup, there is only 1 RS in the test group
2.Trigger a region split
3.The split transaction step after the PONR and sleep, regioninfo in meta has been updated
4.Kill the RS process to mock machine crash
5.ServerCrashProcedure cleanup SPLITING_NEW region, the daughter regions will be deleted
6.ServerCrashProcedure try to assign the parent region, because RS is down and assign fails, the region status is set to FAILED_OPEN and put back into regionsInTransition. But at this time, due to RS crash, the node of the region under ZK region-in-transition no longer exist
7.CatalogJanitor thread is blocked due to RIT
8.Switch active master
9.The CatalogJanitor thread on the new master executes normally and the parent region is cleaned up because split = true && offline = true in the meta table
10.We have a hole in the test table and loss data.
I modified the code when ServerCrashProcedure cleans up the child regions, it will update the parent regioninfo in the meta table, and this problem is no longer reproduced.
I will upload the patch later.