[IMPALA-9150] Restarting minicluster breaks HBase on CDH GBN 1582079 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 3.4.0
Fix Version/s: Impala 3.4.0
Component/s: Infrastructure
Labels:
None

Target Version:

Impala 3.4.0
Epic Color:
ghx-label-9

Description

On the most recent CDH GBN (1582079), restarting HBase using our normal scripts (testdata/bin/kill-hbase.sh / testdata/bin/run-hbase.sh) results in an unusable HBase. Our testdata/bin/kill-hbase.sh script use the kill-java-service.sh script:

"$DIR"/kill-java-service.sh -c HRegionServer -c HMaster -c HQuorumPeer -s 2

This kills the region servers before the master. On CDH GBN 1582079, the master gets unhappy:

19/11/10 16:40:17 INFO master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [localhost,16022,1573402351656]
19/11/10 16:40:17 INFO master.ServerManager: Processing expiration of localhost,16022,1573402351656 on localhost,16000,1573402349553
... same for other region servers ...
19/11/10 16:40:17 INFO procedure.ServerCrashProcedure: Start pid=102, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=false
... same for other region servers ...
19/11/10 16:40:17 INFO master.SplitLogManager: hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting dir is empty, no logs to split.
19/11/10 16:40:17 INFO master.SplitLogManager: Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting] in 0ms
... more stuff ...
19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=false19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=falsejava.lang.NullPointerException at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createAssignProcedures(AssignmentManager.java:646) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:601) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:571) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:188) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:59) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1742) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)

Then, when the master starts up again, it remains unhappy:

19/11/10 16:50:58 WARN master.HMaster: hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, server=localhost,16022,1573402351656}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined.
... more of this ...
19/11/10 16:59:28 WARN master.HMaster: hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, server=localhost,16022,1573402351656}; ServerCrashProcedures=false. Master startup cannot progress, in holding-pattern until region onlined.
19/11/10 17:05:46 ERROR master.HMaster: Master failed to complete initialization after 900000ms. Please consider submitting a bug report including a thread dump of this process.

This continues for an indefinite amount of time.

Current workaround: Use HBase's bin/stop-hbase.sh script rather than our testdata/bin/kill-hbase.sh script. I do not see the problem when using that script, as it does a more graceful shutdown. We should look into changing testdata/bin/kill-hbase.sh to use bin/stop-hbase.sh.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Joe McDonnell

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Nov/19 18:37

Updated:: 23/Nov/19 04:47

Resolved:: 13/Nov/19 22:09