Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9150

Restarting minicluster breaks HBase on CDH GBN 1582079

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 3.4.0
    • Impala 3.4.0
    • Infrastructure
    • None

    Description

      On the most recent CDH GBN (1582079), restarting HBase using our normal scripts (testdata/bin/kill-hbase.sh / testdata/bin/run-hbase.sh) results in an unusable HBase. Our testdata/bin/kill-hbase.sh script use the kill-java-service.sh script:

      "$DIR"/kill-java-service.sh -c HRegionServer -c HMaster -c HQuorumPeer -s 2
      

      This kills the region servers before the master. On CDH GBN 1582079, the master gets unhappy:

      19/11/10 16:40:17 INFO master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [localhost,16022,1573402351656]
      19/11/10 16:40:17 INFO master.ServerManager: Processing expiration of localhost,16022,1573402351656 on localhost,16000,1573402349553
      ... same for other region servers ...
      19/11/10 16:40:17 INFO procedure.ServerCrashProcedure: Start pid=102, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=false
      ... same for other region servers ...
      19/11/10 16:40:17 INFO master.SplitLogManager: hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting dir is empty, no logs to split.
      19/11/10 16:40:17 INFO master.SplitLogManager: Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [hdfs://localhost:20500/hbase/WALs/localhost,16023,1573402352683-splitting] in 0ms
      ... more stuff ...
      19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=false19/11/10 16:40:17 ERROR procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=102, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=localhost,16022,1573402351656, splitWal=true, meta=falsejava.lang.NullPointerException at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createAssignProcedures(AssignmentManager.java:646) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:601) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:571) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:188) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:59) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:965) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1742) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)

      Then, when the master starts up again, it remains unhappy:

      19/11/10 16:50:58 WARN master.HMaster: hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, server=localhost,16022,1573402351656}; ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined.
      ... more of this ...
      19/11/10 16:59:28 WARN master.HMaster: hbase:namespace,,1573402362931.f310ca3bab11adb03eda8614e9ad980b. is NOT online; state={f310ca3bab11adb03eda8614e9ad980b state=OPEN, ts=1573404657428, server=localhost,16022,1573402351656}; ServerCrashProcedures=false. Master startup cannot progress, in holding-pattern until region onlined.
      19/11/10 17:05:46 ERROR master.HMaster: Master failed to complete initialization after 900000ms. Please consider submitting a bug report including a thread dump of this process.

      This continues for an indefinite amount of time.

      Current workaround: Use HBase's bin/stop-hbase.sh script rather than our testdata/bin/kill-hbase.sh script. I do not see the problem when using that script, as it does a more graceful shutdown. We should look into changing testdata/bin/kill-hbase.sh to use bin/stop-hbase.sh.

      Attachments

        Activity

          People

            Unassigned Unassigned
            joemcdonnell Joe McDonnell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: