Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-25650

Reduce MTTR for region server

    XMLWordPrintableJSON

Details

    • Brainstorming
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.4.13
    • None
    • master, regionserver
    • None

    Description

      I some cases in our production that, the machine that runs region server is not functioning well(I could not ssh to that machine, but it respond ping requests), the Region Server process is still running but could not process client requests. It lasts for more than 30 minutes util I remove the znode of that Region Server from ZK manually. That RS is totally unavailable during that time.

      I guess Region Server  still heartbeats to ZK so that the ephemeral node of the RS is not removed by ZK, master does not find that this RS has down.

       

      I think hbase needs a better failure detection except for watching the existence of the ephemeral node created by RS. 

      One thing comes to my mind is running a failure detection( like  The φ Accrual Failure Detector https://www.semanticscholar.org/paper/The-%CF%86-Accrual-Failure-Detector-Hayashibara-D%C3%A9fago/65c40c79e30c1ef33c97a22a3d52cc9f0415a477)  service on master which pings RS periodically so that the master could know the RS is down asap.

       

      Any ideas?

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            synckey Jian Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: